Closed nick-youngblut closed 3 years ago
Hi Nick,
Yes, at the moment inStrain compare
does not scale well at all. I am working on a greedy clustering implementation that should speed things up considerably, and will let you know when it's ready.
That's also a great point about the ability to input a list of inStrain objects- I will implement this as well.
Thanks, Matt
Compare in inStrain version 1.4 is orders of magnitude faster than previous versions. Most comparisons now take less than a minute instead of hours.
Hope this helps and please don't hesitate to reach out with other issues.
Which version/commit are you referring to? Do you mean that version 1.4 is faster or a subsequent commit/version is faster? At least for version 1.4, compare
on 100's of 95% drep genomes and a few hundred samples would take months, based on my experience.
Yeah 1.4 has the faster algorithm. The number of samples is the big factor since there’s pairwise comparisons. I have done up to 120 samples, which took a few days, but doing more than that will take a lot longer.
I do have plans to implement a greedy algorithm that can avoid pairwise comparisons to work with larger number of samples, but it will not be public for a while.
-Matt
On Feb 5, 2021, at 10:16 PM, Nick Youngblut notifications@github.com wrote:
Which version/commit are you referring to? Do you mean that version 1.4 is faster or a subsequent commit/version is faster? At least for version 1.4, compare on 100's of 95% drep genomes and a few hundred samples would take months, based on my experience.
— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/MrOlm/inStrain/issues/14#issuecomment-774408432, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADGNMTPBHEOKQOCKPYGKN4DS5TNFTANCNFSM4MCICPEQ .
It appears that the groups are processed serially:
groups = len(self.scaffold_comparison_groups)
for i, SCgroup in enumerate(self.scaffold_comparison_groups):
logging.info(f'Running group {i+1} of {groups}')
SCgroup.load_cache()
results = inStrain.compare_utils.run_compare_multiprocessing(SCgroup.cmd_queue, SCgroup.result_queue,
self.null_model, num_to_run=len(SCgroup.scaffolds),
**self.kwargs)
for result in results:
if result is not None:
Cdb, Mdb, pair2mm2covOverlap, scaffold = result
for item, lis in zip([Cdb, Mdb, pair2mm2covOverlap, scaffold], [cdbs, mdbs, pair2mm2covOverlaps, order]):
lis.append(item)
SCgroup.purge_cache()
...so if you allow the user to process each group in a separate command, then the user could easily parallelize this work across multiple cluster jobs. This could be done via printing out the number of comparisons to the user in a first round of inStrain compare
, then the user provides a number for the particular comparison that would be provided to self.run_comparisons()
, and then the data is compiled in a final run of inStrain compare
in which all pickled comparison output files are provided by the user.
Alternatively, you could integrate gridmap (or similar) for distributed parallelization.
Discussion continued at https://github.com/MrOlm/inStrain/pull/39
Besides the greedy clustering algorithm (how's the progress?), one method to speed up instrain compare
is to allow the user to set which pairwise comparisons to make. This would help in situations such as when user wants inter-group comparisons (eg., parent to child), while intra-group comparisons are a lower priority (eg., child-child or parent-parent). This would also allow the user to batch the comparisons (eg., 1/4 of all comparisons in job 1, another 1/4 in job 2, etc.)
Progress is slow- my priority number 1 right now is working on stability fixes, and after that I'll work on feature additions like greedy clustering.
Adding the ability to specific specific comparisons to make it idea I've bounced around as well. My main hesitation is not wanting to make the help message to messy with lots of different options, and I also haven't come up with an input framework that I like and would be applicable to the most possible people. I suppose a table of "sample1", "sample2", and "genome" could be a good input format though? It would also making it easier to batch the comparisons. Is a format like that what you have in mind?
Thanks for the quick response! A table of sampleX<tab>sampleY<tab>genome
sounds like a good solution.
OK- I will work on this and post back here when ready
For those interested in the scaling rate. Here's an inStrain compare
job with one reference genome and 360 metagenomes (~5mil reads per sample):
Scaffold to bin was made using .stb file
***************************************************
..:: inStrain compare Step 1. Load data ::..
***************************************************
Loading Profiles into RAM: 100%|██████████| 360/360 [09:34<00:00, 1.60s/it]
44 of 44 scaffolds are in at least 2 samples
***************************************************
..:: inStrain compare Step 2. Run comparisons ::..
***************************************************
Running group 1 of 1
Comparing scaffolds: 98%|█████████▊| 43/44 [209:14:28<19:57:51, 71871.21s/it]
The job has been running for ~10 days. The resource usage:
usage 1: cpu=1289:16:35, mem=91857760.30032 GB s, io=571.36228 GB, vmem=293.730G, maxvmem=300.865G
inStrain version: 1.5.1
I'm running
instrain compare
(v1.2.8) with the following params:inStrain compare --min_cov 5 --min_freq 0.05 --fdr 1e-06 -p 8 -o $OUTDIR -i $INDIRS
, with ~40 genome references and ~280 samples (~1 mil reads per sample). According to the progress report for the command, each of the 8736 iterations is taking ~5.5 hours. More threads will help a bit, but I'm guessing that there still will be a huge time requirement. Is there a good way to substantially increase the speed, or doesinstrain compare
just not scale well to many samples?Also, requiring one to list all paths to instrain objects as the input can lead to total command length limitations, at least on some operating systems. I would be great to be able to provide a file that lists all objects instead of actually listing them all in the command.