Open corneliusroemer opened 2 years ago
Would !A* & !B*
be equivalent to unknown?
Almost, the Xs are not covered this way 😬
Also interesting, you seem to have two types of Unknown
:
https://cov-spectrum.org/explore/World/AllSamples/AllTimes/variants?variantQuery=%21A*+%26+%21B*
And there is yet more, unclassifiable
Do you know what they all are? Never seen pango report these things Unknown
or unclassifiable
Are these labels applied by covSpectrum based on some rules?
What is unclassifiable
?
Oh sorry I seem to have forgotten the screenshot. Here:
Any idea where these 3 types of non-lineage come from. In particular, two types of unknown
are puzzling. Would be good to know how they map to pango lineage calls.
In the GISAID dataset, the pango lineage attribute can contain the values ""
and "None"
. LAPIS was mapping "None"
to null
but did not consider the empty string. When the data reaches the frontend, both null
and ""
were displayed as Unknown
.
In the Nextstrain/GenBank dataset, the pango lineage can be "None"
and "unclassifiable"
- do you know the difference?
Now, LAPIS will map all three values to null
which will then be shown as Unknown
on CoV-Spectrum. This will become fully visible tomorrow.
As a next step, I will think about how to filter for null
in LAPIS.
Right now, it seems not possible to include
Unknown
lineages in advanced queries.This would be very useful to filter out sequences that didn't pass pango QC or are otherwise of questionable quality.
For example, most sequences that are neither Delta nor Omicron right now are from Austria. Probably environmental samples that are a mix of Delta, BA.1 and BA.2 which pango rejects.
A query like this would be nice to work:
!(B.1.1.529* | B.1.617.2* | Unknown)