Closed ArtPoon closed 2 years ago
Maybe we could bump up the compression of identical sequences into variants? The sequences are being converted into their compact representation of mutations and missing sites by process_feed
anyhow.
Is there any way to promote this out of batch_utils.py
and into the main batch.py
workflow?
https://github.com/PoonLab/covizu/blob/660c8ebda417b1a0a311aef4586b0705b1594a6d/covizu/utils/batch_utils.py#L108-L122
Alternatively I could just throw more RAM into the head node for now...
I had an idea the other night that we can gather sequences with the same diffs
key in a nested dictionary.
iss387
gisaid_utils.py
to run locally without database access credentialsgisaid_utils.py
to take a debug
option that limits the number of records to process from the input xz file
art@orolo:~/git/covizu$ python3 covizu/utils/gisaid_utils.py --infile data/provision.2022-03-15T00\:00\:13.json.xz --debug 10000 --vcf covizu/data/ProblematicSites_SARS-CoV2/problematic_sites_sarsCov2.vcf temp.json
🏄 [0:00:00.000007] Processing GISAID feed data
🏄 [0:00:00.805388] aligned 0 records
🏄 [0:00:13.672814] filtered 1019 problematic features
🏄 [0:00:13.672863] 1516 genomes with excess missing sites
🏄 [0:00:13.672871] 539 genomes with excess divergence
art@orolo:~/git/covizu$ python3 covizu/utils/gisaid_utils.py --infile data/provision.2022-03-15T00\:00\:13.json.xz --debug 10000 --vcf covizu/data/ProblematicSites_SARS-CoV2/problematic_sites_sarsCov2.vcf temp2.json
🏄 [0:00:00.000007] Processing GISAID feed data
🏄 [0:00:00.800940] aligned 0 records
🏄 [0:00:13.653183] filtered 1019 problematic features
🏄 [0:00:13.653229] 1516 genomes with excess missing sites
🏄 [0:00:13.653237] 539 genomes with excess divergence
art@orolo:~/git/covizu$ ls -lt temp*.json
-rw-rw-r-- 1 art art 3526578 Mar 23 10:29 temp2.json
-rw-rw-r-- 1 art art 5832136 Mar 23 10:13 temp.json
Output for 10,000 records is about half the size.
Using the memory size estimator described in this blog post, we have the following results:
limit | old (bytes) | old (compute time) | new (bytes) | new (compute time) |
---|---|---|---|---|
1000 | 3822984 | 1.486280 | 917308 | 1.499731 |
10000 | 30135474 | 13.422491 | 8778758 | 13.487156 |
50000 | 146433692 | 1:06.123691 | 39913945 | 1:06.697020 |
100000 | 311645394 | 2:14.006433 | 80385762 | 2:13.576230 |
200000 | 637039239 | 3:36.692794* | 159228509 | 3:34.614973* |
500000 | 1666955993 | 9:15.649669* | 365465623 | 8:56.101607* |
Ok just need to modify downstream functions to take the new by_lineages
data structure
@GopiGugan can you please run a test on a small xz file (say, half a million genomes) using both dev
and iss387
branches, and:
clusters.json
, etc.)?@GopiGugan - I realized the other day that I forgot to cast pos
as an integer in unpack_records
, just pushed a fix to iss387
Traceback (most recent call last):
File "batch.py", line 10, in <module>
from covizu.utils.batch_utils import *
File "/home/covizu/covizu/covizu/utils/batch_utils.py", line 3, in <module>
from covizu import clustering, treetime, beadplot
File "/home/covizu/covizu/covizu/treetime.py", line 16, in <module>
from covizu.utils.batch_utils import unpack_records
ImportError: cannot import name 'unpack_records'
Circular import fixed by doing the following:
diff --git a/covizu/treetime.py b/covizu/treetime.py
index 65d9ddd..786481e 100644
--- a/covizu/treetime.py
+++ b/covizu/treetime.py
@@ -13,7 +13,6 @@ from Bio import Phylo
import covizu
from covizu.utils.seq_utils import *
from covizu.utils.progress_utils import Callback
-from covizu.utils.batch_utils import unpack_records
def fasttree(fasta, binpath='fasttree2', seed=1, gtr=True, collapse=True):
@@ -192,7 +191,7 @@ def retrieve_genomes(by_lineage, known_seqs, ref_file, earliest=True, callback=N
# retrieve unaligned genomes from database
for lineage, records in by_lineage.items():
- records = unpack_records(records)
+ records = covizu.utils.batch_utils.unpack_records(records)
# filter records for lineage-defining genomes
curated = filter(
Ran into another issue:
Traceback (most recent call last):
File "batch.py", line 150, in <module>
timetree, residuals = build_timetree(by_lineage, args, cb.callback)
File "/home/covizu/covizu/covizu/utils/batch_utils.py", line 59, in build_timetree
earliest=True)
File "/home/covizu/covizu/covizu/treetime.py", line 195, in retrieve_genomes
records = covizu.utils.batch_utils.unpack_records(records)
File "/home/covizu/covizu/covizu/utils/batch_utils.py", line 31, in unpack_records
alt = int(alt) # number of nucleotides in indel
ValueError: invalid literal for int() with base 10: 'TTT'
Try replacing:
if typ == '+' or typ == '-':
with
if typ == '-':
500,000 Genomes:
1) Memory Usage:
Used memory-profiler to record memory usage.
For the dev
branch:
For the iss387
branch:
Need to look into why the profiler is showing that there isn't a huge difference in memory usage.
2) Running time
dev
branch: 3h 57m
iss387
branch: 6h 8m
Could it be because the phase where by_lineages
is being generated is only in the first 2000 seconds?
I think that another explanation for the discrepancy is that I was only measuring the memory size of the by_lineage
dictionary, whereas this is measuring the entire memory allocation associated with Python.
@GopiGugan can you please run another experiment with 1M genomes instead of 500K and capture the same statistics?
Based on outputs from @GopiGugan the results are the same
1M genomes:
From the dev
branch:
From the iss387
branch:
dev
branch: 5h 42min
iss387
branch: 5h 47min
I think at this point it is safe to merge iss387
into dev
, thanks for testing @GopiGugan
While analyzing #385, I realized that the
by_lineage
object that we generate inbatch.py
is getting enormous: https://github.com/PoonLab/covizu/blob/660c8ebda417b1a0a311aef4586b0705b1594a6d/batch.py#L147meaning that it swamped the 32GB RAM on my workstation. I'm not sure how we're going to solve this, because we need all the data contained in this dictionary for subsequent steps in the pipeline. We even used to write out this dict as a JSON file. Anyhow, something to keep an eye on.