cov-lineages / pangolin

Software package for assigning SARS-CoV-2 genome sequences to global lineages.
GNU General Public License v3.0
427 stars 107 forks source link

Misclassification of Alpha as Omicron in GISAID #366

Closed krobison13 closed 2 years ago

krobison13 commented 2 years ago

There are a large number of sequences in GISAID that GISAID tags with Omicron lineage and a Pango lineage such as BA.1 / BA.1.1 but which have collection and submission dates predating the Omicron discovery in November 2021

It would appear in the case of hCoV-19/USA/CT-JAX-J000048/2021|EPI_ISL_1203832|2021-02-15 that (I have not looked at more examples) this sequence is classified as Alpha by NextClade and by timing would be more likely to be Alpha

AngieHinrichs commented 2 years ago

Wow, USA/CT-JAX-J000048/2021 does look like a plain B.1.1.7 sequence (with a few reversions to reference pointed out by nextclade). I would expect Scorpio to have overridden the BA.1.1 call because Scorpio doesn't identify it as Omicron. Strangely Scorpio doesn't call it as Alpha either:

hCoV-19/USA/CT-JAX-J000048/2021|EPI_ISL_1203832|2021-02-15,BA.1.1,0.0,0.8496385542168674,,,,PLEARN-v1.2.123,3.1.19,2022-01-20,v1.2.123,passed_qc,

@krobison13 have you tried pangolin --usher? It is slower than pangoLEARN but determines the lineage by placing your sequences on a phylogenetic tree. Ah, for this sequence, Scorpio does override usher's assignment of B.1.1.7 (so the assignment is None; see the note in the final column about overriding the usher call):

hCoV-19/USA/CT-JAX-J000048/2021|EPI_ISL_1203832|2021-02-15,None,,,,,,PUSHER-v1.2.123,3.1.19,,v1.2.123,passed_qc,usher lineage assignment B.1.1.7 was not supported by scorpio; Usher placements: B.1.1.7(4/4)

@rmcolq any idea why Scorpio would override B.1.1.7, but not BA.1.1, when it doesn't match any constellation? (Why Scorpio wouldn't consider this sequence Alpha is another question)

rmcolq commented 2 years ago

This has flagged a few things:

  1. When pangolin checks for false positive VOC/VUI calls, it has been expanding the lineage call, but not expanding the list of VOCs it gets from scorpio. And easy fix on its way...
  2. This sequence is not being classified as B.1.1.7 by scorpio because it has 4 reference calls and the threshold is currently set to 3. Are these ref calls due to true reversions or bioinformatics pipelines?
krobison13 commented 2 years ago

@AngieHinrichs I have not tried running

Here is a non-Alpha sequence (clade 20A in NextClade) which is being classifed as BA.1.1 -- I remembered this time to check this on the Nextclade web app hCoV-19/Spain/VC-IBV-99034177/2021|EPI_ISL_3925372|2021-06-11

despite relatively few amino acid changes (8 substitutions+3 single amino acid deletions in Spike, 8 substitutions in other genes)

rmcolq commented 2 years ago

These 2 examples are both fixed in the latest release