cov-lineages / pangolin

Software package for assigning SARS-CoV-2 genome sequences to global lineages.
GNU General Public License v3.0
426 stars 108 forks source link

Missing clades.txt file with PUSHER inference engine #285

Closed kevinlibuit closed 3 years ago

kevinlibuit commented 3 years ago

When using the PUSHER inference engine, we have observed a few assemblies, (two samples attached), that lead to a pangolin failure with the following error message:

Using UShER as inference engine.
Waiting at most 3 seconds for missing files.
MissingOutputException in line 242 of /miniconda/envs/pangolin/lib/python3.8/site-packages/pangolin/scripts/pangolearn.smk:

Job Missing files after 3 seconds:
/tmp/tmpukat2ldm/clades.txt

These same samples, however, can be assigned a lineage with PLEARN; Sample_01: B.1.1.7 and Sample_02: B.1.617.2.

AngieHinrichs commented 3 years ago

Sorry to hear that and thanks for reporting with example sequences! With the latest version of pangolin and all its components (pangolin v3.1.7, pangoLEARN 2021-07-28, constellations v0.0.11, scorpio v0.3.8), and with this conda list output for usher and faToVcf:

# Name                    Version                   Build  Channel
usher                     0.3.2                h7b4d82f_0    bioconda
ucsc-fatovcf              407                  hd50662f_0    bioconda

I am not seeing the failure. Can you try running pangolin with --no-temp and then send the logs/usher.log file to hopefully show the error message?

kevinlibuit commented 3 years ago

Oh, wow. Working on our end and I think the error may be due to the FASTA header. I de-identified the sample assemblies in the file I uploaded as Sample_01 and Sample_02, but the original files had sample names with five humber (e.g. 12345). Looks like it's only for single-sample fasta files as well. I will post sample files in a second

kevinlibuit commented 3 years ago

Looking back at other occurrences of this failure, we're realizing that they all happen on samples with short (<=5) numeric sample names

kapsakcj commented 3 years ago

Phew, this took us down a rabbit hole 😅

FYI, we first noticed this occurring in the latest staphb/pangolin:3.1.7-pangolearn-2021-07-09 docker container, but was able to replicate the error with a standard local conda install on Ubuntu 20.04. I followed the instructions here to install

pangolin --usher fails when the fasta header is less than 5 characters, but passes with a longer fasta header. So here's the same sequence, just with different fasta header

$ head -n 1 12345.ivar.consensus.fasta
>12345
$ head -n 1 123456.ivar.consensus.fasta
>123456

$ pangolin 12345.ivar.consensus.fasta --usher --outfile 1234567890.pangolin_report.PUSHER-localinstall.csv --min-length 10000 --max-ambig 0.5 --verbose
<output redacted>
[Fri Jul 30 20:14:17 2021]
rule use_usher:
    input: /tmp/tmp_bgh_n3w/not_assigned.fasta, /home/curtis_kapsak/miniconda3/envs/pangolin/lib/python3.8/site-packages/pangolin/data/reference.fasta, /home/curtis_kapsak/miniconda3/envs/pangolin/lib/python3.8/site-packages/pangoLEARN/data/lineageTree.pb
    output: /tmp/tmp_bgh_n3w/clades.txt
    log: /tmp/tmp_bgh_n3w/logs/usher.log
    jobid: 2
    resources: tmpdir=/tmp

        echo "Using UShER as inference engine."
        if [ -s /tmp/tmp_bgh_n3w/not_assigned.fasta ]; then
            faToVcf <(cat /home/curtis_kapsak/miniconda3/envs/pangolin/lib/python3.8/site-packages/pangolin/data/reference.fasta <(echo "") /tmp/tmp_bgh_n3w/not_assigned.fasta) /tmp/tmp_bgh_n3w/sequences.aln.vcf
            usher -n -D -i /home/curtis_kapsak/miniconda3/envs/pangolin/lib/python3.8/site-packages/pangoLEARN/data/lineageTree.pb -v /tmp/tmp_bgh_n3w/sequences.aln.vcf -T 1 -d '/tmp/tmp_bgh_n3w' &> /tmp/tmp_bgh_n3w/logs/usher.log
        else
            rm -f /tmp/tmp_bgh_n3w/clades.txt
            touch /tmp/tmp_bgh_n3w/clades.txt
        fi

Using UShER as inference engine.
Waiting at most 3 seconds for missing files.
MissingOutputException in line 242 of /home/curtis_kapsak/miniconda3/envs/pangolin/lib/python3.8/site-packages/pangolin/scripts/pangolearn.smk:
Job Missing files after 3 seconds:
/tmp/tmp_bgh_n3w/clades.txt
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
Job id: 2 completed successfully, but some output files are missing. 2
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /tmp/tmp_bgh_n3w/.snakemake/log/2021-07-30T201414.326120.snakemake.log

$ pangolin 123456.ivar.consensus.fasta --usher --outfile 123456.pangolin_report.PUSHER-localinstall.csv --min-length 10000 --max-ambig 0.5 --verbose
<output redacted>
Output file written to: /home/curtis_kapsak/tests-galore/outputs/2021-07-30-PUSHER-failure-second-try-different-fastaHeaders/123456.pangolin_report.PUSHER-localinstall.csv
[Fri Jul 30 20:14:52 2021]
Finished job 1.
5 of 6 steps (83%) done
Select jobs to execute...

[Fri Jul 30 20:14:52 2021]
localrule all:
    input: /home/curtis_kapsak/tests-galore/outputs/2021-07-30-PUSHER-failure-second-try-different-fastaHeaders/123456.pangolin_report.PUSHER-localinstall.csv, /tmp/tmpkag00hos/VOC_report.scorpio.csv
    jobid: 0
    resources: tmpdir=/tmp

[Fri Jul 30 20:14:52 2021]
Finished job 0.
6 of 6 steps (100%) done
Complete log: /tmp/tmpkag00hos/.snakemake/log/2021-07-30T201446.609171.snakemake.log

conda environment info:

$ conda list
# packages in environment at /home/curtis_kapsak/miniconda3/envs/pangolin:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       1_gnu    conda-forge
amply                     0.1.4                      py_0    conda-forge
appdirs                   1.4.4              pyh9f0ad1d_0    conda-forge
attrs                     21.2.0             pyhd8ed1ab_0    conda-forge
biopython                 1.74             py38h516909a_0    conda-forge
boost-cpp                 1.74.0               h312852a_4    conda-forge
brotlipy                  0.7.0           py38h497a2fe_1001    conda-forge
bzip2                     1.0.8                h7f98852_4    conda-forge
ca-certificates           2021.5.30            ha878542_0    conda-forge
certifi                   2021.5.30        py38h578d9bd_0    conda-forge
cffi                      1.14.6           py38ha65f79e_0    conda-forge
chardet                   4.0.0            py38h578d9bd_1    conda-forge
charset-normalizer        2.0.0              pyhd8ed1ab_0    conda-forge
coincbc                   2.10.5               hcee13e7_1    conda-forge
configargparse            1.5.1              pyhd8ed1ab_0    conda-forge
connection_pool           0.0.3              pyhd3deb0d_0    conda-forge
constellations            0.0.12                   pypi_0    pypi
cryptography              3.4.7            py38ha5dfef3_0    conda-forge
datrie                    0.8.2            py38h497a2fe_2    conda-forge
docutils                  0.17.1           py38h578d9bd_0    conda-forge
filelock                  3.0.12             pyh9f0ad1d_0    conda-forge
gitdb                     4.0.7              pyhd8ed1ab_0    conda-forge
gitpython                 3.1.18             pyhd8ed1ab_0    conda-forge
gofasta                   0.0.5                h9ee0642_0    bioconda
icu                       68.1                 h58526e2_0    conda-forge
idna                      3.1                pyhd3deb0d_0    conda-forge
importlib-metadata        4.6.1            py38h578d9bd_0    conda-forge
ipython_genutils          0.2.0                      py_1    conda-forge
joblib                    1.0.1                    pypi_0    pypi
jsonschema                3.2.0              pyhd8ed1ab_3    conda-forge
jupyter_core              4.7.1            py38h578d9bd_0    conda-forge
k8                        0.2.5                h9a82719_1    bioconda
ld_impl_linux-64          2.36.1               hea4e1c9_2    conda-forge
libblas                   3.9.0               10_openblas    conda-forge
libcblas                  3.9.0               10_openblas    conda-forge
libffi                    3.3                  h58526e2_2    conda-forge
libgcc-ng                 11.1.0               hc902ee8_4    conda-forge
libgfortran-ng            9.3.0               hff62375_19    conda-forge
libgfortran5              9.3.0               hff62375_19    conda-forge
libgomp                   11.1.0               hc902ee8_4    conda-forge
liblapack                 3.9.0               10_openblas    conda-forge
libopenblas               0.3.17          pthreads_h8fe5266_1    conda-forge
libpng                    1.6.37               h21135ba_2    conda-forge
libprotobuf               3.15.8               h780b84a_0    conda-forge
libstdcxx-ng              11.1.0               h56837e0_4    conda-forge
libuuid                   2.32.1            h7f98852_1000    conda-forge
lz4-c                     1.9.3                h9c3ff4c_0    conda-forge
mafft                     7.487                h779adbc_0    bioconda
minimap2                  2.21                 h5bf99c6_0    bioconda
mysql-connector-c         6.1.11            h6eb9d5d_1007    conda-forge
nbformat                  5.1.3              pyhd8ed1ab_0    conda-forge
ncurses                   6.2                  h58526e2_4    conda-forge
numpy                     1.21.1           py38h9894fe3_0    conda-forge
openssl                   1.1.1k               h7f98852_0    conda-forge
pandas                    1.0.1                    pypi_0    pypi
pango-designation         1.2.47                   pypi_0    pypi
pangolearn                2021-07-28               pypi_0    pypi
pangolin                  3.1.7                    pypi_0    pypi
pip                       19.3.1                   py38_0    conda-forge
psutil                    5.8.0            py38h497a2fe_1    conda-forge
pulp                      2.4              py38h578d9bd_0    conda-forge
pycparser                 2.20               pyh9f0ad1d_2    conda-forge
pyopenssl                 20.0.1             pyhd8ed1ab_0    conda-forge
pyparsing                 2.4.7              pyh9f0ad1d_0    conda-forge
pyrsistent                0.17.3           py38h497a2fe_2    conda-forge
pysam                     0.16.0.1                 pypi_0    pypi
pysocks                   1.7.1            py38h578d9bd_3    conda-forge
python                    3.8.10          h49503c6_1_cpython    conda-forge
python-dateutil           2.8.2                    pypi_0    pypi
python_abi                3.8                      2_cp38    conda-forge
pytz                      2021.1                   pypi_0    pypi
pyyaml                    5.4.1            py38h497a2fe_0    conda-forge
ratelimiter               1.2.0                   py_1002    conda-forge
readline                  8.1                  h46c0cb4_0    conda-forge
requests                  2.26.0             pyhd8ed1ab_0    conda-forge
scikit-learn              0.24.2                   pypi_0    pypi
scipy                     1.7.0                    pypi_0    pypi
scorpio                   0.3.8                    pypi_0    pypi
setuptools                49.6.0           py38h578d9bd_3    conda-forge
six                       1.16.0             pyh6c4a22f_0    conda-forge
smart_open                5.1.0              pyhd8ed1ab_1    conda-forge
smmap                     3.0.5              pyh44b312d_0    conda-forge
snakemake-minimal         6.6.1              pyhdfd78af_0    bioconda
sqlite                    3.36.0               h9cd32fc_0    conda-forge
stopit                    1.1.2                      py_0    conda-forge
tabulate                  0.8.9              pyhd8ed1ab_0    conda-forge
threadpoolctl             2.2.0                    pypi_0    pypi
tk                        8.6.10               h21135ba_1    conda-forge
toposort                  1.6                pyhd8ed1ab_0    conda-forge
traitlets                 5.0.5                      py_0    conda-forge
typing_extensions         3.10.0.0           pyha770c72_0    conda-forge
ucsc-fatovcf              407                  hd50662f_0    bioconda
urllib3                   1.26.6             pyhd8ed1ab_0    conda-forge
usher                     0.3.5                h7b4d82f_0    bioconda
wheel                     0.36.2             pyhd3deb0d_0    conda-forge
wrapt                     1.12.1           py38h497a2fe_3    conda-forge
xz                        5.2.5                h516909a_1    conda-forge
yaml                      0.2.5                h516909a_0    conda-forge
zipp                      3.5.0              pyhd8ed1ab_0    conda-forge
zlib                      1.2.11            h516909a_1010    conda-forge
zstd                      1.5.0                ha95c52a_0    conda-forge

Test files for your reference: 12345.ivar.consensus.fasta.txt 123456.ivar.consensus.fasta.txt

Let us know if we can provide any more information to help!

Curtis

kapsakcj commented 3 years ago

Ah I forgot about the log files you asked for. I re-ran the 12345.ivar.consensus.fasta assembly through again, but with --no-temp option this time and it failed with the same error message. Here are the two logs it produced:

$ cat logs/usher.log
Output newick files will have branch lengths equal to the number of mutations of that branch.

Initializing 1 worker threads.

Loading existing mutation-annotated tree object from file /home/curtis_kapsak/miniconda3/envs/pangolin/lib/python3.8/site-packages/pangoLEARN/data/lineageTree.pb
Completed in 161 msec

Loading VCF file
WARNING: Ignoring sample 12345 as it is already in the tree.
Completed in 0 msec

Found 0 missing samples.

Writing final tree to file /home/curtis_kapsak/tests-galore/outputs/2021-07-30-PUSHER-failure-second-try-different-fastaHeaders/12345-with-no-temp-option/final-tree.nh
The parsimony score for this tree is: 65861
Completed in 44 msec

# and

$ cat logs/minimap2_sam.log
[M::mm_idx_gen::0.003*1.33] collected minimizers
[M::mm_idx_gen::0.004*1.24] sorted minimizers
[M::main::0.004*1.23] loaded/built the index for 1 target sequence(s)
[M::mm_mapopt_update::0.004*1.22] mid_occ = 50
[M::mm_idx_stat] kmer size: 19; skip: 19; is_hpc: 0; #seq: 1
[M::mm_idx_stat::0.004*1.21] distinct minimizers: 2952 (100.00% are singletons); average occurrences: 1.000; average spacing: 10.130; total length: 29903
[M::worker_pipeline::0.015*1.06] mapped 1 sequences
[M::main] Version: 2.21-r1071
[M::main] CMD: minimap2 -a -x asm5 --sam-hit-only --secondary=no -t 1 -o /home/curtis_kapsak/tests-galore/outputs/2021-07-30-PUSHER-failure-second-try-different-fastaHeaders/12345-with-no-temp-option/mapped.sam /home/curtis_kapsak/miniconda3/envs/pangolin/lib/python3.8/site-packages/pangolin/data/reference.fasta /home/curtis_kapsak/tests-galore/outputs/2021-07-30-PUSHER-failure-second-try-different-fastaHeaders/12345-with-no-temp-option/query.post_qc.fasta
[M::main] Real time: 0.016 sec; CPU: 0.017 sec; Peak RSS: 0.005 GB
AngieHinrichs commented 3 years ago

Thanks Curtis! Yes, here's the key line from the usher.log file:

WARNING: Ignoring sample 12345 as it is already in the tree.

The problem is that usher ignores samples that have already been placed in the tree -- and it also ignores samples whose names happen to be the same as the numeric node IDs that usher uses internally. So as you discovered, numbers less than or equal to the number of internal nodes in lineageTree.pb mysteriously fail to get clade assignments from usher.

I ran into the same problem in the UShER web interface and worked around it by detecting numeric sequence names and adding a prefix to them to prevent them from clashing with internal node names. But arguably usher should be doing that, since this isn't the first time that this issue has come up! I will add an issue in the usher repo.

kapsakcj commented 3 years ago

Aha! That makes total sense now. Thanks for clearing this up and for your help, Angie. We appreciate it.

We'll have to advise our colleagues to use more unique sample IDs in the future 😆

AngieHinrichs commented 3 years ago

I'll close this issue since it's not really pangolin's problem, and there's a workaround (don't use purely numeric sequence names; add a prefix or suffix). Hopefully we can change usher to handle numeric names in the future.