PoonLab / covizu

Rapid analysis and visualization of coronavirus genome variation
https://filogeneti.ca/CoVizu/
MIT License
45 stars 20 forks source link

Missing a lot of sequences #528

Closed ArtPoon closed 2 months ago

ArtPoon commented 3 months ago

Currently we are only displaying 3 samples categorized as BA.2.86: image

However the GISAID database currently holds 776 records with this lineage label.

ArtPoon commented 3 months ago
[covizu@Paphlagon data]$ unxz -c provision.2024-04-13T00\:00\:06.json.xz | grep -c "\"covv_lineage\": \"BA.2.86\""
947
GopiGugan commented 3 months ago
=# select count(accession) from sequences where lineage='BA.2.86';
 count
-------
   916
(1 row)

It looks like we have 916 records in the database. Checking to see if these sequences are being filtered

GopiGugan commented 3 months ago

Of the 918 BA.2.86 records that made it to the filter_problematic function:

https://github.com/PoonLab/covizu/blob/ca3379dc823bda5b6e849820a861d60047699d84/covizu/utils/gisaid_utils.py#L196-L267

850 sequences were filtered out as being outliers, 65 were filtered out for having a lot of missing sites

ArtPoon commented 3 months ago

Ok I think we have to turn off molecular clock filtering for now. Let's do the following:

ArtPoon commented 3 months ago

Alternatively it might be easier to modify QPois to return False for every call to is_outlier when it is initialized with cutoff=0

ArtPoon commented 3 months ago

Reprocessing everything in the database to update variants with this change