Closed kevinlibuit closed 3 years ago
Sorry to hear that and thanks for reporting with example sequences! With the latest version of pangolin and all its components (pangolin v3.1.7, pangoLEARN 2021-07-28, constellations v0.0.11, scorpio v0.3.8), and with this conda list output for usher and faToVcf:
# Name Version Build Channel
usher 0.3.2 h7b4d82f_0 bioconda
ucsc-fatovcf 407 hd50662f_0 bioconda
I am not seeing the failure. Can you try running pangolin with --no-temp and then send the logs/usher.log file to hopefully show the error message?
Oh, wow. Working on our end and I think the error may be due to the FASTA header. I de-identified the sample assemblies in the file I uploaded as Sample_01
and Sample_02
, but the original files had sample names with five humber (e.g. 12345
). Looks like it's only for single-sample fasta files as well. I will post sample files in a second
Looking back at other occurrences of this failure, we're realizing that they all happen on samples with short (<=5) numeric sample names
Phew, this took us down a rabbit hole 😅
FYI, we first noticed this occurring in the latest staphb/pangolin:3.1.7-pangolearn-2021-07-09
docker container, but was able to replicate the error with a standard local conda install on Ubuntu 20.04. I followed the instructions here to install
pangolin --usher
fails when the fasta header is less than 5 characters, but passes with a longer fasta header. So here's the same sequence, just with different fasta header
$ head -n 1 12345.ivar.consensus.fasta
>12345
$ head -n 1 123456.ivar.consensus.fasta
>123456
$ pangolin 12345.ivar.consensus.fasta --usher --outfile 1234567890.pangolin_report.PUSHER-localinstall.csv --min-length 10000 --max-ambig 0.5 --verbose
<output redacted>
[Fri Jul 30 20:14:17 2021]
rule use_usher:
input: /tmp/tmp_bgh_n3w/not_assigned.fasta, /home/curtis_kapsak/miniconda3/envs/pangolin/lib/python3.8/site-packages/pangolin/data/reference.fasta, /home/curtis_kapsak/miniconda3/envs/pangolin/lib/python3.8/site-packages/pangoLEARN/data/lineageTree.pb
output: /tmp/tmp_bgh_n3w/clades.txt
log: /tmp/tmp_bgh_n3w/logs/usher.log
jobid: 2
resources: tmpdir=/tmp
echo "Using UShER as inference engine."
if [ -s /tmp/tmp_bgh_n3w/not_assigned.fasta ]; then
faToVcf <(cat /home/curtis_kapsak/miniconda3/envs/pangolin/lib/python3.8/site-packages/pangolin/data/reference.fasta <(echo "") /tmp/tmp_bgh_n3w/not_assigned.fasta) /tmp/tmp_bgh_n3w/sequences.aln.vcf
usher -n -D -i /home/curtis_kapsak/miniconda3/envs/pangolin/lib/python3.8/site-packages/pangoLEARN/data/lineageTree.pb -v /tmp/tmp_bgh_n3w/sequences.aln.vcf -T 1 -d '/tmp/tmp_bgh_n3w' &> /tmp/tmp_bgh_n3w/logs/usher.log
else
rm -f /tmp/tmp_bgh_n3w/clades.txt
touch /tmp/tmp_bgh_n3w/clades.txt
fi
Using UShER as inference engine.
Waiting at most 3 seconds for missing files.
MissingOutputException in line 242 of /home/curtis_kapsak/miniconda3/envs/pangolin/lib/python3.8/site-packages/pangolin/scripts/pangolearn.smk:
Job Missing files after 3 seconds:
/tmp/tmp_bgh_n3w/clades.txt
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
Job id: 2 completed successfully, but some output files are missing. 2
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /tmp/tmp_bgh_n3w/.snakemake/log/2021-07-30T201414.326120.snakemake.log
$ pangolin 123456.ivar.consensus.fasta --usher --outfile 123456.pangolin_report.PUSHER-localinstall.csv --min-length 10000 --max-ambig 0.5 --verbose
<output redacted>
Output file written to: /home/curtis_kapsak/tests-galore/outputs/2021-07-30-PUSHER-failure-second-try-different-fastaHeaders/123456.pangolin_report.PUSHER-localinstall.csv
[Fri Jul 30 20:14:52 2021]
Finished job 1.
5 of 6 steps (83%) done
Select jobs to execute...
[Fri Jul 30 20:14:52 2021]
localrule all:
input: /home/curtis_kapsak/tests-galore/outputs/2021-07-30-PUSHER-failure-second-try-different-fastaHeaders/123456.pangolin_report.PUSHER-localinstall.csv, /tmp/tmpkag00hos/VOC_report.scorpio.csv
jobid: 0
resources: tmpdir=/tmp
[Fri Jul 30 20:14:52 2021]
Finished job 0.
6 of 6 steps (100%) done
Complete log: /tmp/tmpkag00hos/.snakemake/log/2021-07-30T201446.609171.snakemake.log
conda environment info:
$ conda list
# packages in environment at /home/curtis_kapsak/miniconda3/envs/pangolin:
#
# Name Version Build Channel
_libgcc_mutex 0.1 conda_forge conda-forge
_openmp_mutex 4.5 1_gnu conda-forge
amply 0.1.4 py_0 conda-forge
appdirs 1.4.4 pyh9f0ad1d_0 conda-forge
attrs 21.2.0 pyhd8ed1ab_0 conda-forge
biopython 1.74 py38h516909a_0 conda-forge
boost-cpp 1.74.0 h312852a_4 conda-forge
brotlipy 0.7.0 py38h497a2fe_1001 conda-forge
bzip2 1.0.8 h7f98852_4 conda-forge
ca-certificates 2021.5.30 ha878542_0 conda-forge
certifi 2021.5.30 py38h578d9bd_0 conda-forge
cffi 1.14.6 py38ha65f79e_0 conda-forge
chardet 4.0.0 py38h578d9bd_1 conda-forge
charset-normalizer 2.0.0 pyhd8ed1ab_0 conda-forge
coincbc 2.10.5 hcee13e7_1 conda-forge
configargparse 1.5.1 pyhd8ed1ab_0 conda-forge
connection_pool 0.0.3 pyhd3deb0d_0 conda-forge
constellations 0.0.12 pypi_0 pypi
cryptography 3.4.7 py38ha5dfef3_0 conda-forge
datrie 0.8.2 py38h497a2fe_2 conda-forge
docutils 0.17.1 py38h578d9bd_0 conda-forge
filelock 3.0.12 pyh9f0ad1d_0 conda-forge
gitdb 4.0.7 pyhd8ed1ab_0 conda-forge
gitpython 3.1.18 pyhd8ed1ab_0 conda-forge
gofasta 0.0.5 h9ee0642_0 bioconda
icu 68.1 h58526e2_0 conda-forge
idna 3.1 pyhd3deb0d_0 conda-forge
importlib-metadata 4.6.1 py38h578d9bd_0 conda-forge
ipython_genutils 0.2.0 py_1 conda-forge
joblib 1.0.1 pypi_0 pypi
jsonschema 3.2.0 pyhd8ed1ab_3 conda-forge
jupyter_core 4.7.1 py38h578d9bd_0 conda-forge
k8 0.2.5 h9a82719_1 bioconda
ld_impl_linux-64 2.36.1 hea4e1c9_2 conda-forge
libblas 3.9.0 10_openblas conda-forge
libcblas 3.9.0 10_openblas conda-forge
libffi 3.3 h58526e2_2 conda-forge
libgcc-ng 11.1.0 hc902ee8_4 conda-forge
libgfortran-ng 9.3.0 hff62375_19 conda-forge
libgfortran5 9.3.0 hff62375_19 conda-forge
libgomp 11.1.0 hc902ee8_4 conda-forge
liblapack 3.9.0 10_openblas conda-forge
libopenblas 0.3.17 pthreads_h8fe5266_1 conda-forge
libpng 1.6.37 h21135ba_2 conda-forge
libprotobuf 3.15.8 h780b84a_0 conda-forge
libstdcxx-ng 11.1.0 h56837e0_4 conda-forge
libuuid 2.32.1 h7f98852_1000 conda-forge
lz4-c 1.9.3 h9c3ff4c_0 conda-forge
mafft 7.487 h779adbc_0 bioconda
minimap2 2.21 h5bf99c6_0 bioconda
mysql-connector-c 6.1.11 h6eb9d5d_1007 conda-forge
nbformat 5.1.3 pyhd8ed1ab_0 conda-forge
ncurses 6.2 h58526e2_4 conda-forge
numpy 1.21.1 py38h9894fe3_0 conda-forge
openssl 1.1.1k h7f98852_0 conda-forge
pandas 1.0.1 pypi_0 pypi
pango-designation 1.2.47 pypi_0 pypi
pangolearn 2021-07-28 pypi_0 pypi
pangolin 3.1.7 pypi_0 pypi
pip 19.3.1 py38_0 conda-forge
psutil 5.8.0 py38h497a2fe_1 conda-forge
pulp 2.4 py38h578d9bd_0 conda-forge
pycparser 2.20 pyh9f0ad1d_2 conda-forge
pyopenssl 20.0.1 pyhd8ed1ab_0 conda-forge
pyparsing 2.4.7 pyh9f0ad1d_0 conda-forge
pyrsistent 0.17.3 py38h497a2fe_2 conda-forge
pysam 0.16.0.1 pypi_0 pypi
pysocks 1.7.1 py38h578d9bd_3 conda-forge
python 3.8.10 h49503c6_1_cpython conda-forge
python-dateutil 2.8.2 pypi_0 pypi
python_abi 3.8 2_cp38 conda-forge
pytz 2021.1 pypi_0 pypi
pyyaml 5.4.1 py38h497a2fe_0 conda-forge
ratelimiter 1.2.0 py_1002 conda-forge
readline 8.1 h46c0cb4_0 conda-forge
requests 2.26.0 pyhd8ed1ab_0 conda-forge
scikit-learn 0.24.2 pypi_0 pypi
scipy 1.7.0 pypi_0 pypi
scorpio 0.3.8 pypi_0 pypi
setuptools 49.6.0 py38h578d9bd_3 conda-forge
six 1.16.0 pyh6c4a22f_0 conda-forge
smart_open 5.1.0 pyhd8ed1ab_1 conda-forge
smmap 3.0.5 pyh44b312d_0 conda-forge
snakemake-minimal 6.6.1 pyhdfd78af_0 bioconda
sqlite 3.36.0 h9cd32fc_0 conda-forge
stopit 1.1.2 py_0 conda-forge
tabulate 0.8.9 pyhd8ed1ab_0 conda-forge
threadpoolctl 2.2.0 pypi_0 pypi
tk 8.6.10 h21135ba_1 conda-forge
toposort 1.6 pyhd8ed1ab_0 conda-forge
traitlets 5.0.5 py_0 conda-forge
typing_extensions 3.10.0.0 pyha770c72_0 conda-forge
ucsc-fatovcf 407 hd50662f_0 bioconda
urllib3 1.26.6 pyhd8ed1ab_0 conda-forge
usher 0.3.5 h7b4d82f_0 bioconda
wheel 0.36.2 pyhd3deb0d_0 conda-forge
wrapt 1.12.1 py38h497a2fe_3 conda-forge
xz 5.2.5 h516909a_1 conda-forge
yaml 0.2.5 h516909a_0 conda-forge
zipp 3.5.0 pyhd8ed1ab_0 conda-forge
zlib 1.2.11 h516909a_1010 conda-forge
zstd 1.5.0 ha95c52a_0 conda-forge
Test files for your reference: 12345.ivar.consensus.fasta.txt 123456.ivar.consensus.fasta.txt
Let us know if we can provide any more information to help!
Curtis
Ah I forgot about the log files you asked for. I re-ran the 12345.ivar.consensus.fasta
assembly through again, but with --no-temp
option this time and it failed with the same error message. Here are the two logs it produced:
$ cat logs/usher.log
Output newick files will have branch lengths equal to the number of mutations of that branch.
Initializing 1 worker threads.
Loading existing mutation-annotated tree object from file /home/curtis_kapsak/miniconda3/envs/pangolin/lib/python3.8/site-packages/pangoLEARN/data/lineageTree.pb
Completed in 161 msec
Loading VCF file
WARNING: Ignoring sample 12345 as it is already in the tree.
Completed in 0 msec
Found 0 missing samples.
Writing final tree to file /home/curtis_kapsak/tests-galore/outputs/2021-07-30-PUSHER-failure-second-try-different-fastaHeaders/12345-with-no-temp-option/final-tree.nh
The parsimony score for this tree is: 65861
Completed in 44 msec
# and
$ cat logs/minimap2_sam.log
[M::mm_idx_gen::0.003*1.33] collected minimizers
[M::mm_idx_gen::0.004*1.24] sorted minimizers
[M::main::0.004*1.23] loaded/built the index for 1 target sequence(s)
[M::mm_mapopt_update::0.004*1.22] mid_occ = 50
[M::mm_idx_stat] kmer size: 19; skip: 19; is_hpc: 0; #seq: 1
[M::mm_idx_stat::0.004*1.21] distinct minimizers: 2952 (100.00% are singletons); average occurrences: 1.000; average spacing: 10.130; total length: 29903
[M::worker_pipeline::0.015*1.06] mapped 1 sequences
[M::main] Version: 2.21-r1071
[M::main] CMD: minimap2 -a -x asm5 --sam-hit-only --secondary=no -t 1 -o /home/curtis_kapsak/tests-galore/outputs/2021-07-30-PUSHER-failure-second-try-different-fastaHeaders/12345-with-no-temp-option/mapped.sam /home/curtis_kapsak/miniconda3/envs/pangolin/lib/python3.8/site-packages/pangolin/data/reference.fasta /home/curtis_kapsak/tests-galore/outputs/2021-07-30-PUSHER-failure-second-try-different-fastaHeaders/12345-with-no-temp-option/query.post_qc.fasta
[M::main] Real time: 0.016 sec; CPU: 0.017 sec; Peak RSS: 0.005 GB
Thanks Curtis! Yes, here's the key line from the usher.log file:
WARNING: Ignoring sample 12345 as it is already in the tree.
The problem is that usher ignores samples that have already been placed in the tree -- and it also ignores samples whose names happen to be the same as the numeric node IDs that usher uses internally. So as you discovered, numbers less than or equal to the number of internal nodes in lineageTree.pb mysteriously fail to get clade assignments from usher.
I ran into the same problem in the UShER web interface and worked around it by detecting numeric sequence names and adding a prefix to them to prevent them from clashing with internal node names. But arguably usher should be doing that, since this isn't the first time that this issue has come up! I will add an issue in the usher repo.
Aha! That makes total sense now. Thanks for clearing this up and for your help, Angie. We appreciate it.
We'll have to advise our colleagues to use more unique sample IDs in the future 😆
I'll close this issue since it's not really pangolin's problem, and there's a workaround (don't use purely numeric sequence names; add a prefix or suffix). Hopefully we can change usher to handle numeric names in the future.
When using the PUSHER inference engine, we have observed a few assemblies, (two samples attached), that lead to a pangolin failure with the following error message:
These same samples, however, can be assigned a lineage with PLEARN; Sample_01: B.1.1.7 and Sample_02: B.1.617.2.