GenSpectrum / cov-spectrum-website

A web platform to detect and analyze variants of SARS-CoV-2
https://cov-spectrum.org
GNU General Public License v3.0
59 stars 13 forks source link

ENH: Allow querying for `Unknown` lineage in advanced query #425

Open corneliusroemer opened 2 years ago

corneliusroemer commented 2 years ago

Right now, it seems not possible to include Unknown lineages in advanced queries.

This would be very useful to filter out sequences that didn't pass pango QC or are otherwise of questionable quality.

For example, most sequences that are neither Delta nor Omicron right now are from Austria. Probably environmental samples that are a mix of Delta, BA.1 and BA.2 which pango rejects.

A query like this would be nice to work: !(B.1.1.529* | B.1.617.2* | Unknown)

chaoran-chen commented 2 years ago

Would !A* & !B* be equivalent to unknown?

corneliusroemer commented 2 years ago

Almost, the Xs are not covered this way 😬

Also interesting, you seem to have two types of Unknown:

image

https://cov-spectrum.org/explore/World/AllSamples/AllTimes/variants?variantQuery=%21A*+%26+%21B*

corneliusroemer commented 2 years ago

And there is yet more, unclassifiable

Do you know what they all are? Never seen pango report these things Unknown or unclassifiable

Are these labels applied by covSpectrum based on some rules?

chaoran-chen commented 2 years ago

What is unclassifiable?

corneliusroemer commented 2 years ago

Oh sorry I seem to have forgotten the screenshot. Here:

image

Any idea where these 3 types of non-lineage come from. In particular, two types of unknown are puzzling. Would be good to know how they map to pango lineage calls.

chaoran-chen commented 2 years ago

In the GISAID dataset, the pango lineage attribute can contain the values "" and "None". LAPIS was mapping "None" to null but did not consider the empty string. When the data reaches the frontend, both null and "" were displayed as Unknown.

In the Nextstrain/GenBank dataset, the pango lineage can be "None" and "unclassifiable" - do you know the difference?

Now, LAPIS will map all three values to null which will then be shown as Unknown on CoV-Spectrum. This will become fully visible tomorrow.

As a next step, I will think about how to filter for null in LAPIS.