cov-lineages / scorpio

serious constellations of reoccurring phylogenetically-independent origin
GNU General Public License v3.0
38 stars 4 forks source link

Pangolin v1.12 as used by GISAID misclassifies a lot of true BA.5* as BA.2* #48

Open corneliusroemer opened 2 years ago

corneliusroemer commented 2 years ago

Using covSpectrum's advanced queries, I've noticed that the pango assignments that come from GISAID are quite often wrong. I think GISAID still uses pangoLEARN as opposed to Usher. They say they are using designation version 1.12

In Poland as much as 30% of sequences are misclassified BA.2 even though they are true BA.5. In Germany around 5% are misclassified.

Is this due to Scorpio or pangoLEARN?

Something I noticed when looking at a sample of misassigned sequences is that many of them miss the RBD - but that shouldn't stop pangoLEARN/Scorpio from being confident that (most of) these are true BA.5

Here's the full list of sequences that GISAID calls BA.2 but that are BA.5 by Nextclade: https://lapis.cov-spectrum.org/gisaid/v1/sample/gisaid-epi-isl?region=Europe&dateFrom=2022-05-09&variantQuery=nextcladePangoLineage%3ABA.5*++%26+BA.2*&host=Human&accessKey=9Cb3CqmrFnVjO3XCxQLO6gUnKPd&orderBy=random

Here's a sample screenshot from Nextclade showing the RBD region:

image

Query: (https://cov-spectrum.org/explore/Europe/AllSamples/Past3M/variants?variantQuery=nextcladePangoLineage%3ABA.5*++%26+BA.2*&aaMutations1=S%3A346&pangoLineage1=BA.5*&)

image image
corneliusroemer commented 2 years ago

I ran locally and this seems to be a Scorpio issue:

image
aineniamh commented 2 years ago

It's been flagged to GISAID that the mode used should be the default UShER mode, which no longer gets overwritten by scorpio. With the assignment cache @AngieHinrichs prepares it should be fast enough for all purposes. I'd reccommend running in usher mode if you're seeing these misassignments! If you want to do a constellation PR for the scorpio issue happy to take a look!