cov-lineages / pangolin-data

Repository for storing latest model, protobuf, designation hash and alias files for pangolin assignments
GNU General Public License v3.0
29 stars 2 forks source link

lineage assignment change from dataset 1.25.1 -> 1.26 #56

Closed joel-bitscopic closed 6 months ago

joel-bitscopic commented 6 months ago

Hello!

I'm hoping you can help shed some light on why Pangolin is now assigning a lineage of JN.1 to a sample it previously called JN.1.18. I'm attaching the COVID FASTA file as well as the combined raw .csv file output from both samples.

As you will see, Pangolin made 1/2 placements on JN.1 and 1/2 placements on JN.1.18 in both runs. I see some activity around other JN.1 sub-lineages in the release notes but nothing directly affecting either JN.1 or JN.1.18.

PG-238198.fasta.txt

Output combined:

taxon,lineage,conflict,ambiguity_score,scorpio_call,scorpio_support,scorpio_conflict,scorpio_notes,version,pangolin_version,scorpio_version,constellation_version,is_designated,qc_status,qc_notes,note SAMPLE 1,JN.1.18,0.5,,Omicron (BA.2-like),0.84,0.03,scorpio call: Alt alleles 52; Ref alleles 2; Amb alleles 5; Oth alleles 3,PUSHER-v1.25.1,4.3.1,0.3.19,v0.1.12,False,pass,Ambiguous_content:0.03,Usher placements: JN.1(1/2) JN.1.18(1/2) SAMPLE 2,JN.1,0.5,,Omicron (BA.2-like),0.84,0.03,scorpio call: Alt alleles 52; Ref alleles 2; Amb alleles 5; Oth alleles 3,PUSHER-v1.26,4.3.1,0.3.19,v0.1.12,False,pass,Ambiguous_content:0.03,Usher placements: JN.1(1/2) JN.1.18(1/2)

I would sincerely appreciate any insight you can offer. Thanks!

Best, Joel

AngieHinrichs commented 6 months ago

Hi Joel! For this kind of question it can be helpful to upload your sequence to https://usher.bio, which places your sequence on the UCSC UShER tree of SARS-CoV-2 genomes. Since you provided the fasta, I went ahead and did that, changing the default tree from the public-only tree of 8M genomes to the public+GISAID tree of 16M genomes, and raising the default subtree size from 50 to 500, and got this subtree view, zoomed in to show PG-238198 with the most similar sequences: https://nextstrain.org/fetch/hgwdev.gi.ucsc.edu/~angie/pangolin-data-56?label=id:node_6951933

Your sequence is placed on the branch for JN.1.26 which was designated after the latest pango-designation release v1.26, so pangolin can't assign JN.1.26 yet, but it should assign it after the next pangolin-data release which will follow the next pango-designation release (v1.27). In the UShER tree, JN.1.26 is JN.1 > C26894T > G22599C (S:R346T), i.e. on a branch that indicates that C26894T was acquired before G22599C (S:R346T), while JN.1.18 is just JN.1 > G22599C (S:R346T). Our current understanding of this is that S:R346T confers a selective advantage, and so when that mutation happens the virus tends to be transmitted more often, so the branches that independently acquire S:R346T tend to grow faster than other branches. When a branch shows a consistent growth advantage it may be designated as a new lineage, like JN.1.26 in this case.

Internally, pangolin uses a severely pruned version of the full UShER tree (lineageTree.pb in this repository): first 50 random representatives are selected for each lineage, and then private mutations in those representatives are discarded. This keeps some of the diversity within each lineage, but for really large branches like JN.1 it also discards a lot. So when a large lineage has a handful of branches with an advantageous mutation, from one release to the next different branches might be randomly selected. If the C26894T > G22599C branch happened to be included then it would unambiguously pull your sequence into JN.1. But if it's not included, then having some G22599C branches in JN.1 including JN.1.18 (with no clue about C26894T) can make it a toss-up. Sorry about the instability of assignments, but the good news is that it should be more stable for your particular sequence after the next release when JN.1.26 is included.

joel-bitscopic commented 6 months ago

Thanks for the quick reply, Angie! And for the very detailed and thorough response. Your explanation makes sense (and I also now have a better understanding of the inner workings of Pangolin's algorithm).

Thanks again for taking the time to respond. I will share this information with my team and mark this issue as resolved. Have a great day!