cov-lineages / pangoLEARN

Store of the trained model for pangolin to access.
GNU General Public License v3.0
55 stars 13 forks source link

True B.1 sequences from early 2021 falsely classified as BA.1.1 #67

Closed hoelzer closed 2 years ago

hoelzer commented 2 years ago

Hey pango-team!

After the new BA.1.1 model was added to pangolin, several of our German BA.1 sequences got reassigned to this sublineage. Most of them have the S:R346K change so this makes total sense.

However, we also discovered a few sequences from early days (for example, sampled between Feb and Jun 2021) that were previously assigned B.1 and were now assigned BA.1.1 via Pangolin v3.1.17 and PangoLEARN 2022-01-20.

Very likely, these are mis-classified as BA.1.1 based on the sampling date but also the mutation profile (see below) - maybe the PangoLEARN model can/ should be further specified? These are only a few sequences out of ~24k German BA.1.1 but still tools relying on the data might now show quite early BA.1.1 Omicron sequences that are very likely false-positive assignments.

Here are the German GISAID IDs, together with the sampling dates and the older lineage assignment:

ID SAMPLING_DATE OLD_LINEAGE_ASSIGNMENT NEW_LINEAGE_ASSIGNMENT
EPI_ISL_1216704 2021-02-19 B.1 BA.1.1
EPI_ISL_1353756 2021-03-11 B.1 BA.1.1
EPI_ISL_1354034 2021-03-10 B.1 BA.1.1
EPI_ISL_1354069 2021-03-03 B.1 BA.1.1
EPI_ISL_1354191 2021-03-10 B.1 BA.1.1
EPI_ISL_1354195 2021-03-11 B.1 BA.1.1
EPI_ISL_1354221 2021-03-10 B.1 BA.1.1
EPI_ISL_1438731 2021-03-17 B.1 BA.1.1
EPI_ISL_1570185 2021-03-10 B.1 BA.1.1
EPI_ISL_1570237 2021-03-10 B.1 BA.1.1
EPI_ISL_1848596 2021-04-16 B.1 BA.1.1
EPI_ISL_1991614 2021-03-29 B.1 BA.1.1
EPI_ISL_1991621 2021-03-26 B.1 BA.1.1
EPI_ISL_2260658 2021-05-03 B.1 BA.1.1
EPI_ISL_2450266 2021-05-19 B.1 BA.1.1
EPI_ISL_2637014 2021-06-08 B.1 BA.1.1
EPI_ISL_2845485 2021-06-21 B.1 BA.1.1

And here are the amino acid profiles:

accession       aa_profile
EPI_ISL_1216704 ORF1a:A1708D ORF1a:A3730V ORF1b:A1708D ORF1b:A3730V ORF1b:P4715L S:del:68:3 S:N501Y S:D614G S:P681H S:T716I ORF8:Q27* ORF8:R52I ORF8:Y73C N:D3L
EPI_ISL_1353756 ORF1a:A1708D ORF1a:T1754I ORF1a:I2230T ORF1a:del:3675:3 ORF1b:A1708D ORF1b:T1754I ORF1b:I2230T ORF1b:del:3675:3 ORF1b:P4715L S:del:68:3 S:del:143:2 S:N501Y S:D614G S:P681H S:T716I S:M740V S:D1118H ORF8:Q27* ORF8:R52I
EPI_ISL_1354034 ORF1a:T346I ORF1a:A1708D ORF1a:del:3675:3 ORF1a:N3985S ORF1b:T346I ORF1b:A1708D ORF1b:del:3675:3 ORF1b:N3985S ORF1b:P4715L S:del:68:3 S:del:143:2 S:N501Y S:D614G S:P681H S:T716I S:D1118H ORF8:Q27* ORF8:R52I
EPI_ISL_1354069 ORF1a:A1708D ORF1a:del:3675:3 ORF1b:A1708D ORF1b:del:3675:3 ORF1b:P4715L S:del:68:3 S:del:143:2 S:N501Y S:D614G S:P681H S:T716I S:D1118H ORF8:Q27* ORF8:R52I
EPI_ISL_1354191 ORF1a:A1708D ORF1a:I2230T ORF1a:del:3675:3 ORF1b:A1708D ORF1b:I2230T ORF1b:del:3675:3 ORF1b:P4715L S:del:68:3 S:del:143:2 S:N501Y S:D614G S:P681H S:T716I S:D1118H ORF8:Q27* ORF8:R52I
EPI_ISL_1354195 ORF1a:A1708D ORF1a:del:3675:3 ORF1b:A1708D ORF1b:del:3675:3 ORF1b:P4715L S:del:68:3 S:del:143:2 S:N501Y S:D614G S:P681H S:T716I S:D1118H ORF8:Q27* ORF8:R52I
EPI_ISL_1354221 ORF1a:T346I ORF1a:A1708D ORF1a:del:3675:3 ORF1b:T346I ORF1b:A1708D ORF1b:del:3675:3 ORF1b:P4715L S:del:68:3 S:del:143:2 S:N501Y S:D614G S:P681H S:T716I S:D1118H ORF8:Q27* ORF8:R52I
EPI_ISL_1438731 ORF1a:A1708D ORF1a:I2230T ORF1a:M2259I ORF1a:L3644F ORF1a:del:3675:3 ORF1a:L3829F ORF1b:A1708D ORF1b:I2230T ORF1b:M2259I ORF1b:L3644F ORF1b:del:3675:3 ORF1b:L3829F ORF1b:P4715L S:del:68:3 S:del:143:2 S:N501Y S:A570D S:D614G S:P681H S:T716I S:D1118H ORF8:Q27* ORF8:R52I
EPI_ISL_1570185 ORF1a:L642F ORF1a:T1001I ORF1a:T1637I ORF1a:A1708D ORF1b:L642F ORF1b:T1001I ORF1b:T1637I ORF1b:A1708D ORF1b:P4715L ORF1b:T5941I S:del:68:3 S:del:143:2 S:N501Y S:A570D S:D614G S:P681H S:T716I S:D1118H ORF6:Y31H ORF7b:E3Q ORF8:Q27* ORF8:R52I ORF8:K68* ORF8:Y73C N:D3L
EPI_ISL_1570237 ORF1a:G379E ORF1a:T2021I ORF1b:G379E ORF1b:T2021I ORF1b:P4715L S:del:68:3 S:del:143:2 S:N501Y S:A570D S:D614G S:P681H S:T716I S:D1118H ORF3a:G172C ORF3a:T221A ORF8:Q27* ORF8:R52I ORF8:Y73C N:D3L
EPI_ISL_1848596 ORF1a:E347K ORF1a:A1708D ORF1a:del:3675:3 ORF1a:L3829F ORF1b:E347K ORF1b:A1708D ORF1b:del:3675:3 ORF1b:L3829F ORF1b:P4715L S:del:68:3 S:del:143:2 S:N501Y S:D614G S:P681H S:T716I S:D1118H ORF8:Q27* ORF8:R52I
EPI_ISL_1991614 ORF1a:T1001I ORF1a:A1708D ORF1a:I2230T ORF1a:del:3675:3 ORF1b:T1001I ORF1b:A1708D ORF1b:I2230T ORF1b:del:3675:3 ORF1b:P4715L S:del:68:3 S:del:143:2 S:D614G S:P681H S:T716I S:D1118H ORF8:Q27* ORF8:R52I ORF8:Y73C
EPI_ISL_1991621 ORF1b:P4715L S:del:68:3 S:del:143:2 S:A570D S:D614G S:P681H S:T716I S:D1118H ORF8:Q27* ORF8:R52I ORF8:Y73C N:S235F
EPI_ISL_2260658 ORF1a:A1708D ORF1a:del:3675:3 ORF1b:A1708D ORF1b:del:3675:3 ORF1b:P4715L S:H66Y S:del:68:3 S:del:143:2 S:D614G S:P681H S:T716I ORF8:Q27* ORF8:R52I N:del:208:2
EPI_ISL_2450266 ORF1a:T1638I ORF1a:A1708D ORF1a:T3255I ORF1a:M3655I ORF1b:T1638I ORF1b:A1708D ORF1b:T3255I ORF1b:M3655I ORF1b:P4715L ORF1b:T5941I S:del:68:3 S:V367F S:N501Y S:Q613H S:P681H S:T716I S:D1118H ORF3a:W131C E:I46V ORF7a:T115A ORF8:Q27* ORF8:R52I ORF8:K68* ORF8:Y73C ORF8:E92K N:S202N
EPI_ISL_2637014 ORF1a:T346I ORF1a:A1708D ORF1a:L2105P ORF1a:del:3675:3 ORF1a:F3753V ORF1b:T346I ORF1b:A1708D ORF1b:L2105P ORF1b:del:3675:3 ORF1b:F3753V ORF1b:P4715L S:del:68:3 S:del:143:2 S:N501Y S:D614G S:P681H S:A684V S:T716I S:D1118H ORF8:Q27* ORF8:R52I ORF8:del:119:2
EPI_ISL_2845485 ORF1a:T1001I ORF1a:P2046L ORF1a:T3646A ORF1b:T1001I ORF1b:P2046L ORF1b:T3646A ORF1b:P4715L ORF1b:G6173V S:del:68:3 S:del:143:2 S:A570D S:D614G S:P681H S:T716I S:D1118H ORF8:Q27* ORF8:S43A ORF8:R52I ORF8:Y73C N:S235F

I also checked our GISAID data dump quickly and found 144x BA.1.1 sampled between 2021-02-01:2021-07-01 (sorted by Country, also includes the German IDs mentioned above):

accession       zip     date    lineage
EPI_ISL_1718297 Armenia 2021-04-26      BA.1.1
EPI_ISL_2192464 Belgium 2021-05-20      BA.1.1
EPI_ISL_2614271 Brazil  2021-06-18      BA.1.1
EPI_ISL_1787444.2       France  2021-04-29      BA.1.1
EPI_ISL_1936751.2       France  2021-05-06      BA.1.1
EPI_ISL_1936891.2       France  2021-05-06      BA.1.1
EPI_ISL_1936928.2       France  2021-05-06      BA.1.1
EPI_ISL_1937019.2       France  2021-05-06      BA.1.1
EPI_ISL_1988408.2       France  2021-05-10      BA.1.1
EPI_ISL_1988520.2       France  2021-05-10      BA.1.1
EPI_ISL_2095388.2       France  2021-05-14      BA.1.1
EPI_ISL_2095618.2       France  2021-05-14      BA.1.1
EPI_ISL_2135600 France  2021-05-17      BA.1.1
EPI_ISL_2293321 France  2021-05-27      BA.1.1
EPI_ISL_2454239 France  2021-06-08      BA.1.1
EPI_ISL_1216704 Germany 2021-03-11      BA.1.1
EPI_ISL_1353756 Germany 2021-03-25      BA.1.1
EPI_ISL_1354034 Germany 2021-03-25      BA.1.1
EPI_ISL_1354069 Germany 2021-03-25      BA.1.1
EPI_ISL_1354191 Germany 2021-03-25      BA.1.1
EPI_ISL_1354195 Germany 2021-03-25      BA.1.1
EPI_ISL_1354221 Germany 2021-03-25      BA.1.1
EPI_ISL_1438731 Germany 2021-04-01      BA.1.1
EPI_ISL_1570185 Germany 2021-04-13      BA.1.1
EPI_ISL_1570237 Germany 2021-04-13      BA.1.1
EPI_ISL_1848596 Germany 2021-05-03      BA.1.1
EPI_ISL_1991614 Germany 2021-05-10      BA.1.1
EPI_ISL_1991621 Germany 2021-05-10      BA.1.1
EPI_ISL_2260658 Germany 2021-05-25      BA.1.1
EPI_ISL_2450266 Germany 2021-06-08      BA.1.1
EPI_ISL_2637014 Germany 2021-06-22      BA.1.1
EPI_ISL_1970212 India   2021-05-09      BA.1.1
EPI_ISL_2189609 India   2021-05-20      BA.1.1
EPI_ISL_2460436 India   2021-06-09      BA.1.1
EPI_ISL_2460520 India   2021-06-09      BA.1.1
EPI_ISL_2460937 India   2021-06-09      BA.1.1
EPI_ISL_2461034 India   2021-06-09      BA.1.1
EPI_ISL_2461079 India   2021-06-09      BA.1.1
EPI_ISL_2461108 India   2021-06-09      BA.1.1
EPI_ISL_2461225 India   2021-06-09      BA.1.1
EPI_ISL_2461319 India   2021-06-09      BA.1.1
EPI_ISL_2461320 India   2021-06-09      BA.1.1
EPI_ISL_2461323 India   2021-06-09      BA.1.1
EPI_ISL_2461329 India   2021-06-09      BA.1.1
EPI_ISL_2461336 India   2021-06-09      BA.1.1
EPI_ISL_2461347 India   2021-06-09      BA.1.1
EPI_ISL_2461655 India   2021-06-09      BA.1.1
EPI_ISL_2461925 India   2021-06-09      BA.1.1
EPI_ISL_2461978 India   2021-06-09      BA.1.1
EPI_ISL_2504595 India   2021-06-13      BA.1.1
EPI_ISL_2504722 India   2021-06-13      BA.1.1
EPI_ISL_2504730 India   2021-06-13      BA.1.1
EPI_ISL_2504832 India   2021-06-13      BA.1.1
EPI_ISL_2504833 India   2021-06-13      BA.1.1
EPI_ISL_2504834 India   2021-06-13      BA.1.1
EPI_ISL_2504835 India   2021-06-13      BA.1.1
EPI_ISL_2504947 India   2021-06-13      BA.1.1
EPI_ISL_2555728 India   2021-06-16      BA.1.1
EPI_ISL_2555755 India   2021-06-16      BA.1.1
EPI_ISL_2555771 India   2021-06-16      BA.1.1
EPI_ISL_2555775 India   2021-06-16      BA.1.1
EPI_ISL_2555800 India   2021-06-16      BA.1.1
EPI_ISL_2555804 India   2021-06-16      BA.1.1
EPI_ISL_2555832 India   2021-06-16      BA.1.1
EPI_ISL_2555842 India   2021-06-16      BA.1.1
EPI_ISL_2555910 India   2021-06-16      BA.1.1
EPI_ISL_2556008 India   2021-06-16      BA.1.1
EPI_ISL_2556023 India   2021-06-16      BA.1.1
EPI_ISL_2556097 India   2021-06-16      BA.1.1
EPI_ISL_2556181 India   2021-06-16      BA.1.1
EPI_ISL_2556185 India   2021-06-16      BA.1.1
EPI_ISL_2556268 India   2021-06-16      BA.1.1
EPI_ISL_2556465 India   2021-06-16      BA.1.1
EPI_ISL_2556847 India   2021-06-16      BA.1.1
EPI_ISL_2556957 India   2021-06-16      BA.1.1
EPI_ISL_1656493 Israel  2021-04-20      BA.1.1
EPI_ISL_2545260 Israel  2021-06-15      BA.1.1
EPI_ISL_2545291 Israel  2021-06-15      BA.1.1
EPI_ISL_1087406 Italy   2021-02-26      BA.1.1
EPI_ISL_1299047 Italy   2021-03-19      BA.1.1
EPI_ISL_1324028 Italy   2021-03-24      BA.1.1
EPI_ISL_1390531 Italy   2021-03-29      BA.1.1
EPI_ISL_1557653 Italy   2021-04-12      BA.1.1
EPI_ISL_1558337 Italy   2021-04-12      BA.1.1
EPI_ISL_1670898 Italy   2021-04-21      BA.1.1
EPI_ISL_1828784 Italy   2021-05-03      BA.1.1
EPI_ISL_1970576 Italy   2021-05-09      BA.1.1
EPI_ISL_2602652 Kenya   2021-06-18      BA.1.1
EPI_ISL_2362124 Portugal        2021-05-31      BA.1.1
EPI_ISL_2362125 Portugal        2021-05-31      BA.1.1
EPI_ISL_2688275 Puerto Rico     2021-06-25      BA.1.1
EPI_ISL_1713507 Qatar   2021-04-25      BA.1.1
EPI_ISL_2408388 Qatar   2021-06-04      BA.1.1
EPI_ISL_1169867 Spain   2021-03-06      BA.1.1
EPI_ISL_1970393 Sri Lanka       2021-05-09      BA.1.1
EPI_ISL_1197222 Sweden  2021-03-10      BA.1.1
EPI_ISL_1290529 Sweden  2021-03-18      BA.1.1
EPI_ISL_1360895 Switzerland     2021-03-25      BA.1.1
EPI_ISL_1407033 Switzerland     2021-03-31      BA.1.1
EPI_ISL_1658254 Switzerland     2021-04-20      BA.1.1
EPI_ISL_2017203 Switzerland     2021-05-11      BA.1.1
EPI_ISL_1052519 United Kingdom  2021-02-23      BA.1.1
EPI_ISL_2542692 United Kingdom  2021-06-15      BA.1.1
EPI_ISL_1203832 USA     2021-03-11      BA.1.1
EPI_ISL_1306649 USA     2021-03-22      BA.1.1
EPI_ISL_1306697 USA     2021-03-22      BA.1.1
EPI_ISL_1307056 USA     2021-03-22      BA.1.1
EPI_ISL_1307189 USA     2021-03-22      BA.1.1
EPI_ISL_1307450 USA     2021-03-22      BA.1.1
EPI_ISL_1560386 USA     2021-04-12      BA.1.1
EPI_ISL_1561957 USA     2021-04-12      BA.1.1
EPI_ISL_1609299 USA     2021-04-15      BA.1.1
EPI_ISL_1621935 USA     2021-04-16      BA.1.1
EPI_ISL_1680753 USA     2021-04-22      BA.1.1
EPI_ISL_1687096 USA     2021-04-22      BA.1.1
EPI_ISL_1791523 USA     2021-04-29      BA.1.1
EPI_ISL_1801074 USA     2021-04-29      BA.1.1
EPI_ISL_1801315 USA     2021-04-29      BA.1.1
EPI_ISL_1801882 USA     2021-04-29      BA.1.1
EPI_ISL_1839267 USA     2021-05-03      BA.1.1
EPI_ISL_1839467 USA     2021-05-03      BA.1.1
EPI_ISL_1909571 USA     2021-05-05      BA.1.1
EPI_ISL_2045439 USA     2021-05-12      BA.1.1
EPI_ISL_2289174 USA     2021-05-13      BA.1.1
EPI_ISL_2104936 USA     2021-05-16      BA.1.1
EPI_ISL_2190440 USA     2021-05-20      BA.1.1
EPI_ISL_2190699 USA     2021-05-20      BA.1.1
EPI_ISL_2190737 USA     2021-05-20      BA.1.1
EPI_ISL_2190840 USA     2021-05-20      BA.1.1
EPI_ISL_2205140 USA     2021-05-21      BA.1.1
EPI_ISL_2212021 USA     2021-05-21      BA.1.1
EPI_ISL_2225473 USA     2021-05-21      BA.1.1
EPI_ISL_2232216 USA     2021-05-23      BA.1.1
EPI_ISL_2232242 USA     2021-05-23      BA.1.1
EPI_ISL_2232245 USA     2021-05-23      BA.1.1
EPI_ISL_2233340 USA     2021-05-24      BA.1.1
EPI_ISL_2273122.2       USA     2021-05-25      BA.1.1
EPI_ISL_2305122 USA     2021-05-27      BA.1.1
EPI_ISL_2323563 USA     2021-05-28      BA.1.1
EPI_ISL_2534235 USA     2021-06-14      BA.1.1
EPI_ISL_2549328 USA     2021-06-16      BA.1.1
EPI_ISL_2611157 USA     2021-06-18      BA.1.1
EPI_ISL_2686512 USA     2021-06-25      BA.1.1
EPI_ISL_2697699 USA     2021-06-28      BA.1.1
animesh-workplace commented 2 years ago

This change was observed after the constellation update from 0.1.1 to 0.1.2 where this was introduced

Update to ensure that more lower quality samples that could be classified as sublineages BA. get a "Probable Omicron (BA.-like)" call instead of a parent call. Make the parent "Omicron (Unclassified)" and remove mrca_lineage field from it so that pangolin does not call lineage B.1.1.529 (there are no designated sequences)

Might be worth looking into constellation or scorpio for the same @corneliusroemer.

corneliusroemer commented 2 years ago

The first (and only) sequence I looked at is a totally normal Alpha/B.1.1.7

Something must have happened in the latest pangoLEARN release, the designations for BA.1.1 were done directly from my custom Omicron build by @chrisruis so I'm pretty confident they are clean.

This is a spuriously misclasssified sequence: hCoV-19/Germany/BY-RKI-I-046610/2021|EPI_ISL_1354034|2021-03-10 image

corneliusroemer commented 2 years ago

Magnitude of the problem: ca. 1 in 5000 Alpha sequences gets misclassified as BA.1.1, neither BA.1 nor BA.2 have such false positives.

image

https://cov-spectrum.org/explore/World/AllSamples/from=2020-01-06&to=2021-10-29/variants?aaMutations=ORF8%3AQ27*&pangoLineage=BA.*

I queried specifically for the Alpha ORF8 stop, because around 200 sequences from pre Nov 2021 without that stop could just be date entry errors, see here for example Italian sequences from the first few days of January 2022 image

rmcolq commented 2 years ago

I've been tracking down the problems here and with the related https://github.com/cov-lineages/pangolin/issues/366 issue. Firstly it does look like pangoLEARN is overclassifying BA.1.1 sequences. A new model is training. In the mean time, it surprised me that this was not being caught by scorpio but there appear to be 2 things going on there:

  1. The way the False positive overwrite is currently written in pangolin, it is was not expanding the alias for the scorpio VOC/VUI list. An easy fix.
  2. These sequences are not matching the current scorpio definition of B.1.1.7. The one I looked at had too many ambiguous bases and therefore missed the alt allele threshold. Now that scorpio has a way of defining "Probable" sequences, we could add a second definition to capture these if we are confident that they should be Alpha. The examples I've seen that are being misclassified either have lots of ambiguous bases or too many ref calls.
corneliusroemer commented 2 years ago

Thanks! There is no anti-scorpio that says: this is definitely not an Omicron? So instead of making dodgy Alphas Alphas, we could make definitely not Omicrons None. I mean, what does Scorpio say about this being Omicron?

Or is this the point 1 you mentioned, which failed due to Alias expansion not happening and does BA.* not checked against the B.1.1.529 rule?

On Fri, Feb 4, 2022, 15:30 Rachel Colquhoun @.***> wrote:

I've been tracking down the problems here and with the related cov-lineages/pangolin#366 https://github.com/cov-lineages/pangolin/issues/366 issue. Firstly it does look like pangoLEARN is overclassifying BA.1.1 sequences. A new model is training. In the mean time, it surprised me that this was not being caught by scorpio but there appear to be 2 things going on there:

  1. The way the False positive overwrite is currently written in pangolin, it is was not expanding the alias for the scorpio VOC/VUI list. An easy fix.
  2. These sequences are not matching the current scorpio definition of B.1.1.7. The one I looked at had too many ambiguous bases and therefore missed the alt allele threshold. Now that scorpio has a way of defining "Probable" sequences, we could add a second definition to capture these if we are confident that they should be Alpha. The examples I've seen that are being misclassified either have lots of ambiguous bases or too many ref calls.

— Reply to this email directly, view it on GitHub https://github.com/cov-lineages/pangoLEARN/issues/67#issuecomment-1030040456, or unsubscribe https://github.com/notifications/unsubscribe-auth/AF77AQMG6D2MZD22WDEYQO3UZPPIZANCNFSM5NNEMI7A . You are receiving this because you were mentioned.Message ID: @.***>

rmcolq commented 2 years ago

The expected behaviour (that was not happening) was that scorpio did not think it was omicron, and pangolin ought then to override the lineage assignment with None. So yes, this is my point 1. And the reason it wasn't checked against B.1.1.529 is because we have discontinued lineage assignments of B.1.1.529 as there are no designated sequences and it was causing confusion by being assigned to sequences which just have problems with low quality/ref calls.

rmcolq commented 2 years ago

These false positive BA.1.1 get lineage assignment "None" with the latest release

hoelzer commented 2 years ago

Thanks, that's great!