PoonLab / covizu

Rapid analysis and visualization of coronavirus genome variation
https://filogeneti.ca/CoVizu/
MIT License
45 stars 20 forks source link

Irregular formatting in lineages.csv file is causing the pipeline to crash #416

Closed ArtPoon closed 1 year ago

ArtPoon commented 1 year ago

No jobs currently running on the cluster.

[covizu@BEVi ~]$ ls -lt
total 128468
-rw-rw-r--  1 covizu covizu 100834723 Aug 26 04:47 batch.log
...
[covizu@BEVi ~]$ tail batch.log
🏄 [4:45:19.977236] aligned 7210000 records
🏄 [4:45:44.887383] aligned 7220000 records
🏄 [4:46:12.302827] aligned 7230000 records
🏄 [4:46:37.618633] aligned 7240000 records
🏄 [4:47:05.719435] aligned 7250000 records
🏄 [4:47:30.212193] aligned 7260000 records
🏄 [4:47:48.093177] filtered 10928859 problematic features
🏄 [4:47:48.093250]          4290588 genomes with excess missing sites
🏄 [4:47:48.093260]          657490 genomes with excess divergence
🏄 [4:47:48.093449] Parsing Pango lineage designations

Seems to have stalled on parsing PANGO lineages?

GopiGugan commented 1 year ago

Could this be related to recent updates to the lineage.csv file?

Failed August 23 run:

Entering 'covizu/data/pango-designation'
Updating 2f12f2f..fcad365
Fast-forward
 deduplicate_keeping_last.py      |  38 ++
 lineage_notes.txt                |  15 +
 lineages.csv                     | 737 ++++++++++++++++++++++++++++++++++++---
 pango_designation/__init__.py    |   2 +-
 pango_designation/alias_key.json |   5 +-
 5 files changed, 748 insertions(+), 49 deletions(-)

Failed August 26 run:

Entering 'covizu/data/pango-designation'
Updating fcad365..3c27f23
Fast-forward
 lineage_notes.txt                |   6 +
 lineages.csv                     | 279 +++++++++++++++++++++++++++++++++++++--
 pango_designation/alias_key.json |   4 +-
ArtPoon commented 1 year ago

Maybe related to this commit? https://github.com/cov-lineages/pango-designation/commit/aa9e72bc76df05b1bdab28a94ba9d1c9c6bd3547 (blank line in middle of file)

ArtPoon commented 1 year ago

And a similar error was patched on Aug 23: https://github.com/cov-lineages/pango-designation/commit/60d1cc75a27ebe5d92435762afe4ffddae3ad68b

ArtPoon commented 1 year ago

We might need to write something to catch these edge cases

ArtPoon commented 1 year ago

Current update got past this step, so we should be ok

ArtPoon commented 1 year ago

Runs crashed again, let's not close this until we can catch this edge case

ArtPoon commented 1 year ago

Screen lineages.csv file for empty lines and special characters

GopiGugan commented 1 year ago

The line strain userOrOld date Nextstrain_clade pango_lineage genbank_accession country Nextstrain_clade_usher pango_lineage_usher accession in lineages.csv is causing this issue

diff --git a/covizu/treetime.py b/covizu/treetime.py
index 7e0c7e1..9a83521 100644
--- a/covizu/treetime.py
+++ b/covizu/treetime.py
@@ -287,8 +287,14 @@ if __name__ == '__main__':
         sys.exit()
     lineages = {}
     for line in handle:
-        taxon, lineage = line.strip().split(',')
-        lineages.update({taxon: lineage})
+        try:
+            taxon, lineage = line.strip().split(',')
+            if taxon and lineage:
+                lineages.update({taxon: lineage})
+            else:
+                cb.callback("Warning '{}': taxon or lineage is missing".format(line), level='WARN')
+        except:
+            cb.callback("Warning: There is an issue with the line '{}' in lineages.csv".format(line), level='WARN')

     cb.callback("Identifying lineage representative genomes")
     fasta = retrieve_genomes(by_lineage, known_seqs=lineages, ref_file=args.ref, earliest=args.earliest,
diff --git a/covizu/utils/batch_utils.py b/covizu/utils/batch_utils.py
index 6f54e08..27de99d 100644
--- a/covizu/utils/batch_utils.py
+++ b/covizu/utils/batch_utils.py
@@ -51,8 +51,16 @@ def build_timetree(by_lineage, args, callback=None):
         sys.exit()
     lineages = {}
     for line in handle:
-        taxon, lineage = line.strip().split(',')
-        lineages.update({taxon: lineage})
+        try:
+            taxon, lineage = line.strip().split(',')
+            if taxon and lineage:
+                lineages.update({taxon: lineage})
+            else:
+                if callback:
+                    callback("Warning '{}': taxon or lineage is missing".format(line), level='WARN')
+        except:
+            if callback:
+                callback("Warning: There is an issue with the line '{}' in lineages.csv".format(line), level='WARN')

     if callback:
         callback("Identifying lineage representative genomes")