BenKaehler / q2-makarsa

A QIIME 2 plugin to generate and visualise microbial networks.
BSD 3-Clause "New" or "Revised" License
8 stars 4 forks source link

Run time and memory requirements #50

Open LenaFloerl opened 1 year ago

LenaFloerl commented 1 year ago

Hi,

thanks for putting this together! 🚀

I've been trying to apply it to my FeatureTable of 323 samples, ~5000 features, and ~2,900,000 total frequency. Running it with 8 cores it crashed after 3 days when the memory usage reached 60 GB. Do you have any recommendation for the number of cores to use, and maybe an estimate for the corresponding run time and memory requirements?

Thanks a lot! 🙌

Best, Lena

BenKaehler commented 1 year ago

Thanks @LenaFloerl!

I'm not an expert on scaling SpiecEasi. @zdk123, are you please able to offer any advice for a data set that is this shape?

@zdk123, please let me know if I need to move Issue #18 up the priority list to handle data sets like this one. (Not all of SpiecEasi's batch processing options have been exposed in q2-makarsa yet.)

In the mean time, an option may to be filter your features, say by minimum frequency or the number of samples in which each is observed, to make the problem more computationally tractable.

zdk123 commented 1 year ago

I don't have good general advice, unfortunately because these things are so data-dependent.

Batching processing would definitely ease the memory requirements though, even on a single machine, because intermediate results are stored on disk. Reducing rep.num and/or nlambda would reduce processing time and memory requirements, and increasing lambda.min.ratio forces models to be sparser, which saves memory. Running in bounded StARS mode uses a statistical observation to restrict how much sub-sampling is done on the full lambda path.

Generally I recommend to start out by aggressively filtering - at least to see what the networks look like on trusted subsets, without having to wait 3 days for results. You can always scale back up once you're happy and have a sense of the computational requirements of the jobs.

LenaFloerl commented 1 year ago

Great - thanks a lot! I'll try to filter and rerun with a smaller dataset. 💪