PoonLab / covizu

Rapid analysis and visualization of coronavirus genome variation
https://filogeneti.ca/CoVizu/
MIT License
45 stars 20 forks source link

Memory consumption by `by_lineage` dict #387

Closed ArtPoon closed 2 years ago

ArtPoon commented 2 years ago

While analyzing #385, I realized that the by_lineage object that we generate in batch.py is getting enormous: https://github.com/PoonLab/covizu/blob/660c8ebda417b1a0a311aef4586b0705b1594a6d/batch.py#L147

meaning that it swamped the 32GB RAM on my workstation. I'm not sure how we're going to solve this, because we need all the data contained in this dictionary for subsequent steps in the pipeline. We even used to write out this dict as a JSON file. Anyhow, something to keep an eye on.

ArtPoon commented 2 years ago

Maybe we could bump up the compression of identical sequences into variants? The sequences are being converted into their compact representation of mutations and missing sites by process_feed anyhow.

ArtPoon commented 2 years ago

Is there any way to promote this out of batch_utils.py and into the main batch.py workflow? https://github.com/PoonLab/covizu/blob/660c8ebda417b1a0a311aef4586b0705b1594a6d/covizu/utils/batch_utils.py#L108-L122

ArtPoon commented 2 years ago

Alternatively I could just throw more RAM into the head node for now...

ArtPoon commented 2 years ago

I had an idea the other night that we can gather sequences with the same diffs key in a nested dictionary.

ArtPoon commented 2 years ago
ArtPoon commented 2 years ago
art@orolo:~/git/covizu$ python3 covizu/utils/gisaid_utils.py --infile data/provision.2022-03-15T00\:00\:13.json.xz --debug 10000 --vcf covizu/data/ProblematicSites_SARS-CoV2/problematic_sites_sarsCov2.vcf temp2.json
🏄 [0:00:00.000007] Processing GISAID feed data
🏄 [0:00:00.800940] aligned 0 records
🏄 [0:00:13.653183] filtered 1019 problematic features
🏄 [0:00:13.653229]          1516 genomes with excess missing sites
🏄 [0:00:13.653237]          539 genomes with excess divergence
art@orolo:~/git/covizu$ ls -lt temp*.json
-rw-rw-r-- 1 art art 3526578 Mar 23 10:29 temp2.json
-rw-rw-r-- 1 art art 5832136 Mar 23 10:13 temp.json

Output for 10,000 records is about half the size.

ArtPoon commented 2 years ago

Using the memory size estimator described in this blog post, we have the following results:

limit old (bytes) old (compute time) new (bytes) new (compute time)
1000 3822984 1.486280 917308 1.499731
10000 30135474 13.422491 8778758 13.487156
50000 146433692 1:06.123691 39913945 1:06.697020
100000 311645394 2:14.006433 80385762 2:13.576230
200000 637039239 3:36.692794* 159228509 3:34.614973*
500000 1666955993 9:15.649669* 365465623 8:56.101607*
ArtPoon commented 2 years ago
ArtPoon commented 2 years ago

Ok just need to modify downstream functions to take the new by_lineages data structure

ArtPoon commented 2 years ago

@GopiGugan can you please run a test on a small xz file (say, half a million genomes) using both dev and iss387 branches, and:

  1. record memory usage
  2. record computing time
  3. confirm that the outputs are identical (i.e., clusters.json, etc.)?
ArtPoon commented 2 years ago

@GopiGugan - I realized the other day that I forgot to cast pos as an integer in unpack_records, just pushed a fix to iss387

GopiGugan commented 2 years ago
Traceback (most recent call last):
  File "batch.py", line 10, in <module>
    from covizu.utils.batch_utils import *
  File "/home/covizu/covizu/covizu/utils/batch_utils.py", line 3, in <module>
    from covizu import clustering, treetime, beadplot
  File "/home/covizu/covizu/covizu/treetime.py", line 16, in <module>
    from covizu.utils.batch_utils import unpack_records
ImportError: cannot import name 'unpack_records'

Circular import fixed by doing the following:

diff --git a/covizu/treetime.py b/covizu/treetime.py
index 65d9ddd..786481e 100644
--- a/covizu/treetime.py
+++ b/covizu/treetime.py
@@ -13,7 +13,6 @@ from Bio import Phylo
 import covizu
 from covizu.utils.seq_utils import *
 from covizu.utils.progress_utils import Callback
-from covizu.utils.batch_utils import unpack_records

 def fasttree(fasta, binpath='fasttree2', seed=1, gtr=True, collapse=True):
@@ -192,7 +191,7 @@ def retrieve_genomes(by_lineage, known_seqs, ref_file, earliest=True, callback=N

     # retrieve unaligned genomes from database
     for lineage, records in by_lineage.items():
-        records = unpack_records(records)
+        records = covizu.utils.batch_utils.unpack_records(records)

         # filter records for lineage-defining genomes
         curated = filter(
GopiGugan commented 2 years ago

Ran into another issue:

Traceback (most recent call last):
  File "batch.py", line 150, in <module>
    timetree, residuals = build_timetree(by_lineage, args, cb.callback)
  File "/home/covizu/covizu/covizu/utils/batch_utils.py", line 59, in build_timetree
    earliest=True)
  File "/home/covizu/covizu/covizu/treetime.py", line 195, in retrieve_genomes
    records = covizu.utils.batch_utils.unpack_records(records)
  File "/home/covizu/covizu/covizu/utils/batch_utils.py", line 31, in unpack_records
    alt = int(alt)  # number of nucleotides in indel
ValueError: invalid literal for int() with base 10: 'TTT'
ArtPoon commented 2 years ago

Try replacing: if typ == '+' or typ == '-': with if typ == '-':

GopiGugan commented 2 years ago

500,000 Genomes:

1) Memory Usage:

Used memory-profiler to record memory usage.

For the dev branch: dev_branch

For the iss387 branch: iss387_branch

Need to look into why the profiler is showing that there isn't a huge difference in memory usage.

2) Running time

dev branch: 3h 57m iss387 branch: 6h 8m

ArtPoon commented 2 years ago

Could it be because the phase where by_lineages is being generated is only in the first 2000 seconds?

ArtPoon commented 2 years ago

I think that another explanation for the discrepancy is that I was only measuring the memory size of the by_lineage dictionary, whereas this is measuring the entire memory allocation associated with Python.

ArtPoon commented 2 years ago

@GopiGugan can you please run another experiment with 1M genomes instead of 500K and capture the same statistics?

ArtPoon commented 2 years ago

Based on outputs from @GopiGugan the results are the same

GopiGugan commented 2 years ago

1M genomes:

  1. Memory Usage:

From the dev branch: dev

From the iss387 branch: iss387

  1. Running Time:

dev branch: 5h 42min iss387 branch: 5h 47min

ArtPoon commented 2 years ago

I think at this point it is safe to merge iss387 into dev, thanks for testing @GopiGugan