cov-lineages / pangolin

Software package for assigning SARS-CoV-2 genome sequences to global lineages.
GNU General Public License v3.0
419 stars 108 forks source link

Pangolin conflict #508

Open carlottaolivero opened 1 year ago

carlottaolivero commented 1 year ago

Hi, We have some questions regarding the result of Pango Lineage from https://pangolin.cog-uk.io/.

1) For some samples, Pangolin web generates a conflict: Lineage - Note BA.2 - Usher placements: BA.2(1/2) BA.2.10.1(1/2) BQ.1.19 -  Usher placements: BQ.1.1(1/2) BQ.1.19(1/2)

In this case the conflic is 1/2, what's the reason behind calling BA.2 instead of BA.2.10.1 for the first sample and calling BQ.1.19 instead of BQ.1.1 for the second one?

2) It seems that for some samples the result of Pango website doesn't coincide with the result of Pango lineage given by GISAID (https://gisaid.org/) even if the data version is the same. In this case, I am referring to "Pango v.4.2 consensus call".
The following table summarizes the results we are referring to.

image

What could be the reason of this difference?

Many thanks for the help and for your amazing work! Carlotta Olivero

AngieHinrichs commented 1 year ago

Hi! For your first question: usher searches for the most parsimonious placement of your sequence in a tree that represents a random sample of the diversity within each Pango lineage as annotated on UCSC's UShER tree. For some sequences, especially those with a lot of N bases (low coverage / no-call), there are multiple branches on the tree that match your sequence equally well. Unfortunately there isn't a good way to get the details of which mutations make the sequence have equally parsimonious placements somewhere in BA.2 and somewhere in BA.2.10.1.

One thing that you can try in these cases is the UShER web interface (https://usher.bio) which places your sequence in the full UShER tree of almost 15 million sequences, instead of the much smaller downsampled tree used by pangolin. There is a higher chance of finding sequences that are more similar to your sequence in the full tree, and that may help to resolve where the sequence really belongs. But again if the sequence has many Ns, or many locations where the reference sequence has been used to fill in missing sequence, or is a mixture of genomes from different lineages (e.g. from a co-infection or recombinant), then it may have multiple equally not-great matches in the full tree too.

For your second question: I'm not sure exactly what GISAID's "consensus call" means, but I think they might look at results from both UShER and pangoLEARN mode, as well as Scorpio (which can override pangoLEARN's result but not usher's in pangolin output), and use some kind of heuristic to resolve differences between them. Best to ask GISAID how they compute the consensus.