Closed hoelzer closed 2 years ago
This change was observed after the constellation update from 0.1.1 to 0.1.2 where this was introduced
Update to ensure that more lower quality samples that could be classified as sublineages BA. get a "Probable Omicron (BA.-like)" call instead of a parent call. Make the parent "Omicron (Unclassified)" and remove mrca_lineage field from it so that pangolin does not call lineage B.1.1.529 (there are no designated sequences)
Might be worth looking into constellation or scorpio for the same @corneliusroemer.
The first (and only) sequence I looked at is a totally normal Alpha/B.1.1.7
Something must have happened in the latest pangoLEARN release, the designations for BA.1.1 were done directly from my custom Omicron build by @chrisruis so I'm pretty confident they are clean.
This is a spuriously misclasssified sequence: hCoV-19/Germany/BY-RKI-I-046610/2021|EPI_ISL_1354034|2021-03-10
Magnitude of the problem: ca. 1 in 5000 Alpha sequences gets misclassified as BA.1.1, neither BA.1 nor BA.2 have such false positives.
I queried specifically for the Alpha ORF8 stop, because around 200 sequences from pre Nov 2021 without that stop could just be date entry errors, see here for example Italian sequences from the first few days of January 2022
I've been tracking down the problems here and with the related https://github.com/cov-lineages/pangolin/issues/366 issue. Firstly it does look like pangoLEARN is overclassifying BA.1.1 sequences. A new model is training. In the mean time, it surprised me that this was not being caught by scorpio but there appear to be 2 things going on there:
Thanks! There is no anti-scorpio that says: this is definitely not an Omicron? So instead of making dodgy Alphas Alphas, we could make definitely not Omicrons None. I mean, what does Scorpio say about this being Omicron?
Or is this the point 1 you mentioned, which failed due to Alias expansion not happening and does BA.* not checked against the B.1.1.529 rule?
On Fri, Feb 4, 2022, 15:30 Rachel Colquhoun @.***> wrote:
I've been tracking down the problems here and with the related cov-lineages/pangolin#366 https://github.com/cov-lineages/pangolin/issues/366 issue. Firstly it does look like pangoLEARN is overclassifying BA.1.1 sequences. A new model is training. In the mean time, it surprised me that this was not being caught by scorpio but there appear to be 2 things going on there:
- The way the False positive overwrite is currently written in pangolin, it is was not expanding the alias for the scorpio VOC/VUI list. An easy fix.
- These sequences are not matching the current scorpio definition of B.1.1.7. The one I looked at had too many ambiguous bases and therefore missed the alt allele threshold. Now that scorpio has a way of defining "Probable" sequences, we could add a second definition to capture these if we are confident that they should be Alpha. The examples I've seen that are being misclassified either have lots of ambiguous bases or too many ref calls.
— Reply to this email directly, view it on GitHub https://github.com/cov-lineages/pangoLEARN/issues/67#issuecomment-1030040456, or unsubscribe https://github.com/notifications/unsubscribe-auth/AF77AQMG6D2MZD22WDEYQO3UZPPIZANCNFSM5NNEMI7A . You are receiving this because you were mentioned.Message ID: @.***>
The expected behaviour (that was not happening) was that scorpio did not think it was omicron, and pangolin ought then to override the lineage assignment with None. So yes, this is my point 1. And the reason it wasn't checked against B.1.1.529 is because we have discontinued lineage assignments of B.1.1.529 as there are no designated sequences and it was causing confusion by being assigned to sequences which just have problems with low quality/ref calls.
These false positive BA.1.1 get lineage assignment "None" with the latest release
Thanks, that's great!
Hey pango-team!
After the new BA.1.1 model was added to pangolin, several of our German BA.1 sequences got reassigned to this sublineage. Most of them have the S:R346K change so this makes total sense.
However, we also discovered a few sequences from early days (for example, sampled between Feb and Jun 2021) that were previously assigned B.1 and were now assigned BA.1.1 via Pangolin v3.1.17 and PangoLEARN 2022-01-20.
Very likely, these are mis-classified as BA.1.1 based on the sampling date but also the mutation profile (see below) - maybe the PangoLEARN model can/ should be further specified? These are only a few sequences out of ~24k German BA.1.1 but still tools relying on the data might now show quite early BA.1.1 Omicron sequences that are very likely false-positive assignments.
Here are the German GISAID IDs, together with the sampling dates and the older lineage assignment:
And here are the amino acid profiles:
I also checked our GISAID data dump quickly and found 144x BA.1.1 sampled between 2021-02-01:2021-07-01 (sorted by Country, also includes the German IDs mentioned above):