cov-lineages / pangolin-data

Repository for storing latest model, protobuf, designation hash and alias files for pangolin assignments
GNU General Public License v3.0
27 stars 2 forks source link

B.1.118 gets called as B.1 #49

Open KatSteinke opened 8 months ago

KatSteinke commented 8 months ago

With version 1.23.1, one of our positive controls which has been consistently called as B.1.118 suddenly gets called as B.1. We're running pangolin 4.3 in usher placement mode, relevant versions are

 - constellations==0.1.12
  - pangolin==4.3
  - pangolin-data==1.23.1
  - scorpio==0.3.17
  - tabulate<0.9.0
  - usher==0.6.3

Given it's a positive control I should be able to share the sequence if needed, but it looks like this might be a general issue with B.1.118 sequences - UCSC UShER gives the same results for a bunch of B.1.118 genomes from GISAID, while COG-UK (still on 1.22) gives B.1.118 - kudos to Ammar Aziz over on the µbioinfo slack for digging into it.

rmcolq commented 8 months ago

There was a comment on the release of 1.23: ** NOTE: the v1.23 tree provokes a corner-case bug in usher-sampled prior to version 0.6.3 that causes some lineage A samples to be assigned to A. sublineages or even B or B.* sublineages. If you will be running pangolin on early 2020 sequences that may be lineage A, then it is highly recommended to use the assignment cache (install by running pangolin --add-assignment-cache, run pangolin on input sequences with --use-assignment-cache) and to update the usher package in your pangolin environment to 0.6.3 as soon as it is released. Are you using the assignment cache mode?

KatSteinke commented 8 months ago

We’re not - we don’t have any A lineages among our control samples, and I‘d understood the instructions in the notes as a workaround until Usher 0.6.3 was available and thus assumed it wasn’t relevant now that version was out. I’ll try and see how it looks with assignment cache mode as soon as I can.

rmcolq commented 8 months ago

Perhaps @AngieHinrichs can clarify if this is the problem still?

KatSteinke commented 8 months ago

The issue seems to persist with --add-assignment cache followed by running with --use-assignment-cache.

 pangolin /path/to/positive_control.consensus.fasta --outfile /path/to/pangolin-assignment.csv --threads 6 --analysis-mode usher --use-assignment-cache
in a fresh conda env with the specs given above results in the following output: taxon lineage conflict ambiguity_score scorpio_call scorpio_support scorpio_conflict scorpio_notes version pangolin_version scorpio_version constellation_version is_designated qc_status qc_notes note
positive_control B.1 0.0 PUSHER-v1.23.1 4.3 0.3.17 v0.1.12 False pass Ambiguous_content:0.02 Usher placements: B.1(1/1)
AngieHinrichs commented 8 months ago

Thanks for reporting this. I will fix it in the next release.

Due to a recent shuffling around of the order in which mutations are annotated on successive branches, B.1.118 is annotated on a small branch within the larger branch where it should be annotated, with two extra mutations, one of which is absent from most samples. In previous versions, although B.1.118 was annotated on a branch that had the two extra mutations, the extra mutations were placed on a larger branch, and then one was reverted to reference on a sub-branch that covered most of the samples. The effect was that in the previous version, B.1.118 samples without the extra mutation(s) would be placed on the branch where B.1.118 was annotated (with a reversion on the mutation that shouldn't have been in the path in the first place), but now, with an arguably better structure / order of mutations, B.1.118 is annotated on a sub-branch and I need to fix the annotation.