MrOlm / inStrain

Bioinformatics program inStrain
MIT License
141 stars 33 forks source link

inStrain compare not scaling #14

Closed nick-youngblut closed 3 years ago

nick-youngblut commented 4 years ago

I'm running instrain compare (v1.2.8) with the following params: inStrain compare --min_cov 5 --min_freq 0.05 --fdr 1e-06 -p 8 -o $OUTDIR -i $INDIRS, with ~40 genome references and ~280 samples (~1 mil reads per sample). According to the progress report for the command, each of the 8736 iterations is taking ~5.5 hours. More threads will help a bit, but I'm guessing that there still will be a huge time requirement. Is there a good way to substantially increase the speed, or does instrain compare just not scale well to many samples?

Also, requiring one to list all paths to instrain objects as the input can lead to total command length limitations, at least on some operating systems. I would be great to be able to provide a file that lists all objects instead of actually listing them all in the command.

MrOlm commented 4 years ago

Hi Nick,

Yes, at the moment inStrain compare does not scale well at all. I am working on a greedy clustering implementation that should speed things up considerably, and will let you know when it's ready.

That's also a great point about the ability to input a list of inStrain objects- I will implement this as well.

Thanks, Matt

MrOlm commented 3 years ago

Compare in inStrain version 1.4 is orders of magnitude faster than previous versions. Most comparisons now take less than a minute instead of hours.

Hope this helps and please don't hesitate to reach out with other issues.

nick-youngblut commented 3 years ago

Which version/commit are you referring to? Do you mean that version 1.4 is faster or a subsequent commit/version is faster? At least for version 1.4, compare on 100's of 95% drep genomes and a few hundred samples would take months, based on my experience.

MrOlm commented 3 years ago

Yeah 1.4 has the faster algorithm. The number of samples is the big factor since there’s pairwise comparisons. I have done up to 120 samples, which took a few days, but doing more than that will take a lot longer.

I do have plans to implement a greedy algorithm that can avoid pairwise comparisons to work with larger number of samples, but it will not be public for a while.

-Matt

On Feb 5, 2021, at 10:16 PM, Nick Youngblut notifications@github.com wrote:



Which version/commit are you referring to? Do you mean that version 1.4 is faster or a subsequent commit/version is faster? At least for version 1.4, compare on 100's of 95% drep genomes and a few hundred samples would take months, based on my experience.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/MrOlm/inStrain/issues/14#issuecomment-774408432, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADGNMTPBHEOKQOCKPYGKN4DS5TNFTANCNFSM4MCICPEQ .

nick-youngblut commented 3 years ago

It appears that the groups are processed serially:

        groups = len(self.scaffold_comparison_groups)
        for i, SCgroup in enumerate(self.scaffold_comparison_groups):
            logging.info(f'Running group {i+1} of {groups}')
            SCgroup.load_cache()
            results = inStrain.compare_utils.run_compare_multiprocessing(SCgroup.cmd_queue, SCgroup.result_queue,
                                                                         self.null_model, num_to_run=len(SCgroup.scaffolds),
                                                                         **self.kwargs)
            for result in results:
                if result is not None:
                    Cdb, Mdb, pair2mm2covOverlap, scaffold = result
                    for item, lis in zip([Cdb, Mdb, pair2mm2covOverlap, scaffold], [cdbs, mdbs, pair2mm2covOverlaps, order]):
                        lis.append(item)

            SCgroup.purge_cache()

...so if you allow the user to process each group in a separate command, then the user could easily parallelize this work across multiple cluster jobs. This could be done via printing out the number of comparisons to the user in a first round of inStrain compare, then the user provides a number for the particular comparison that would be provided to self.run_comparisons(), and then the data is compiled in a final run of inStrain compare in which all pickled comparison output files are provided by the user.

Alternatively, you could integrate gridmap (or similar) for distributed parallelization.

MrOlm commented 3 years ago

Discussion continued at https://github.com/MrOlm/inStrain/pull/39

nick-youngblut commented 3 years ago

Besides the greedy clustering algorithm (how's the progress?), one method to speed up instrain compare is to allow the user to set which pairwise comparisons to make. This would help in situations such as when user wants inter-group comparisons (eg., parent to child), while intra-group comparisons are a lower priority (eg., child-child or parent-parent). This would also allow the user to batch the comparisons (eg., 1/4 of all comparisons in job 1, another 1/4 in job 2, etc.)

MrOlm commented 3 years ago

Progress is slow- my priority number 1 right now is working on stability fixes, and after that I'll work on feature additions like greedy clustering.

Adding the ability to specific specific comparisons to make it idea I've bounced around as well. My main hesitation is not wanting to make the help message to messy with lots of different options, and I also haven't come up with an input framework that I like and would be applicable to the most possible people. I suppose a table of "sample1", "sample2", and "genome" could be a good input format though? It would also making it easier to batch the comparisons. Is a format like that what you have in mind?

nick-youngblut commented 3 years ago

Thanks for the quick response! A table of sampleX<tab>sampleY<tab>genome sounds like a good solution.

MrOlm commented 3 years ago

OK- I will work on this and post back here when ready

nick-youngblut commented 3 years ago

For those interested in the scaling rate. Here's an inStrain compare job with one reference genome and 360 metagenomes (~5mil reads per sample):

Scaffold to bin was made using .stb file
***************************************************
    ..:: inStrain compare Step 1. Load data ::..
***************************************************

Loading Profiles into RAM: 100%|██████████| 360/360 [09:34<00:00,  1.60s/it]
44 of 44 scaffolds are in at least 2 samples
***************************************************
..:: inStrain compare Step 2. Run comparisons ::..
***************************************************

Running group 1 of 1
Comparing scaffolds:  98%|█████████▊| 43/44 [209:14:28<19:57:51, 71871.21s/it]

The job has been running for ~10 days. The resource usage:

usage         1:            cpu=1289:16:35, mem=91857760.30032 GB s, io=571.36228 GB, vmem=293.730G, maxvmem=300.865G

inStrain version: 1.5.1