PoonLab / covizu

Rapid analysis and visualization of coronavirus genome variation
https://filogeneti.ca/CoVizu/
MIT License
45 stars 20 forks source link

Error retrieving a record from the database #532

Open GopiGugan opened 1 month ago

GopiGugan commented 1 month ago

Pipeline is failing because it is trying to retrieve a record from the database that doesn't exist.

For example, sequences for lineage XDT were inserted into the database, however, the cluster information was not.

Originally the thought was that the lineage assignment changed for a sequence in a recent provisions file. However, that does not seem to be the case.

ArtPoon commented 1 month ago

@GopiGugan currently rebuilding database, will investigate whether cluster information is reproducibly failing to be inserted for XDT

GopiGugan commented 1 month ago

For example, sequences for lineage XDT were inserted into the database, however, the cluster information was not.

I believe there isn't a cluster record for XDT since the XDT records were previously filtered in the filter_problematic function. https://github.com/PoonLab/covizu/blob/0631f5e6e46f072b2482e97cad0bb61efcc6b8eb/covizu/utils/gisaid_utils.py#L196-L200

Lineages that appear in by_lineage but were not previously inserted into the clusters table should be processed again:

diff --git a/batch.py b/batch.py
index 878ac70..91641d8 100644
--- a/batch.py
+++ b/batch.py
@@ -435,7 +435,17 @@ if __name__ == "__main__":
         SELECT DISTINCT LINEAGE FROM NEW_RECORDS;
         '''
         CUR.execute(UPDATED_LINEAGES_QUERY)
-        UPDATED_LINEAGES = [row['lineage'] for row in CUR.fetchall()]
+        new_records_lineages = [row['lineage'] for row in CUR.fetchall()]
+
+        by_lineage_list = list(by_lineage.keys())
+        clusters_lineages_query = '''
+        SELECT DISTINCT LINEAGE FROM CLUSTERS;
+        '''
+        CUR.execute(clusters_lineages_query)
+        clusters_lineages = [row['lineage'] for row in CUR.fetchall()]
+        unique_by_lineage = list(set(by_lineage_list) - set(clusters_lineages))
+
+        UPDATED_LINEAGES = list(set(new_records_lineages).union(set(unique_by_lineage)))