cov-lineages / pangolin

Software package for assigning SARS-CoV-2 genome sequences to global lineages.
GNU General Public License v3.0
427 stars 107 forks source link

Pangolin v4.2 stuck on "Using UShER as inference engine." #500

Closed VinceLiAB closed 1 year ago

VinceLiAB commented 1 year ago

Hi,

Pangolin v4.2 analysis seems to be stuck on the step "Using UShER as inference engine.". I have tried analyzing different sets of data ranging from 20 to 90 samples and they all stop at the same step.

No error messages are given and I didn't encounter this issue prior to the update.

Thank you.

AngieHinrichs commented 1 year ago

Sorry to hear that @VinceLiAB. If you run 'usher --version' what is the output?

VinceLiAB commented 1 year ago

The usher version was definitely the culprit. I updated to v0.6.1 from v0.6.0 and it is working again. Thanks for the quick response!

AngieHinrichs commented 1 year ago

Great, glad it's working for you now!

wm75 commented 1 year ago

@AngieHinrichs v0.6.1 still seems to have a problem with small test input. Simply running pangolin pangolin/data/reference.fasta causes usher-sampled to hang.

This is why the tests for the bioconda recipe update are failing.

wm75 commented 1 year ago

see https://github.com/bioconda/bioconda-recipes/pull/38768

AngieHinrichs commented 1 year ago

Oof, thanks @wm75! I tested with tests/test-data/sequence1.fasta which has a single sequence... but I did not test with reference.fasta! -- which leads to a VCF file with no data lines (no mutations), which might be triggering some corner case in usher-sampled. @yceh can you please take a look? Here is the header-only VCF file that is causing usher-sampled to hang:

##fileformat=VCFv4.2
##reference=/data/tmp/tmp1tcryjgq/sequences.withref.fa:outgroup_A
##source=faToVcf /data/tmp/tmp1tcryjgq/sequences.withref.fa /data/tmp/tmp1tcryjgq/sequences.aln.vcf
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  5a7f5aa9677f248abcb2bedf90d7f3e2
wm75 commented 1 year ago

Ah, that makes a lot of sense, thanks! Unfortunately, the bioconda test cannot use the test-data sequence because that's not getting installed. We could introduce a SNP at runtime though for now if that's all that's needed to make the test work, then revert the patch when you have an usher fix.

wm75 commented 1 year ago

We could introduce a SNP at runtime though for now if that's all that's needed to make the test work, then revert the patch when you have an usher fix.

Yes, it is sufficient to change just a single base in the sequence before using it as a test input!

AngieHinrichs commented 1 year ago

Ah, nice idea with the patch! Something like this should work:

sed -e 's/ACATGGTTTAGTCAGCGTGG/ACATGGTTTAGCCAGCGTGG/' pangolin/data/reference.fasta > $tempFasta

[Edit: NM I see you found your own 😁]

AngieHinrichs commented 1 year ago

@yceh and @yatisht have already fixed it and released usher v0.6.2: https://github.com/yatisht/usher/releases/tag/v0.6.2

wm75 commented 1 year ago

Thanks @AngieHinrichs @yceh @yatisht! The bioconda packages for usher 0.6.2 and for pangolin 4.2 using 0.6.2 of usher are now available.

pangolin 4.2 will also appear on usegalaxy.eu later today, together with pangolin 4.1.3 pinned to the same core dependencies, i.e. both Galaxy tool versions will use:

This way comparisons between usher and usher-sampled should be relatively simple.

AngieHinrichs commented 1 year ago

Great, thanks so much @wm75!

comparisons between usher and usher-sampled

Just for the record, results should be overall very consistent but not identical, especially when sequences have Ns in lineage-defining positions. usher may place a sequence on a node that starts a lineage even if it has only Ns at the defining mutations (the mutations on the node that starts the lineage), but usher-sampled doesn't match all-Ns on the node at the end of the path -- it places it on the parent of that node, so in cases like that the sample will be assigned the parental lineage by usher-sampled. Also, usher would find some redundant equally parsimonious placements (EPPs) while usher-sampled is more stringent, so in cases where multiple EPPs would cause different assignments and pangolin takes a vote, the outcomes can be different. [Next on my list: get rid of the voting; with amplicon dropout issues it's looking like a bad idea now, see #492.]