franciscozorrilla / metaGEM

:gem: An easy-to-use workflow for generating context specific genome-scale metabolic models and predicting metabolic interactions within microbial communities directly from metagenomic data
https://franciscozorrilla.github.io/metaGEM/
MIT License
189 stars 41 forks source link

Query about SMETANA 'detailed' mode runtimes and parallel processing #64

Closed shreyanshumale closed 3 years ago

shreyanshumale commented 3 years ago

Hello Francisco!

I obtained around 93 MAGs from the metaGEM workflow and ran smetana for these 93 organisms using smetana. Even though I assigned 30 cores for this job, it seems that the CPU utilization indicates otherwise (shows only one core is used). Is there a way to ensure that the parallel processing smoothly functions for smetana as the runtime has already been upwards of 4 days for a single media (M11)? I independently tried running smetana in 'global' mode with the communities.tsv file having multiple different sized communities, and it turns out the CPU utilization shows active parallel processing and runtimes are much lower.

I know that detailed mode is expected to take higher runtimes but is there:

  1. Some data about community size vs. runtimes, so that if runtimes are higher, we can maybe break down a large community into multiple smaller communities, parallelly process them, and in the end, combine the results from all using various centrality measures (similar to the 'global' mode approach).
  2. Is there a way to ensure that parallel processing is taking place when the 'detailed' mode is run?

Please let me know!

Thanks, Shreyansh

franciscozorrilla commented 3 years ago

Hi Shreyansh,

Yes that is normal, the --global mode in SMETANA should run much quicker than the --detailed mode.

  1. You may want to check out this post in the SMETANA repo, it discusses how one may deal with large communities. Indeed, simulations can take very long times for communities > 50; I believe the largest I have simulated were ~60 species and that took ~ 1 week, if I recall correctly.
  2. From my experience, SMETANA with the --detailed does not parallelize efficiently, so running with 1 or 2 cores should be enough (you would likely be wasting or under-using resources with 30 cores). I do not think that there is a way to ensure parallelization due to the implementation, but @cdanielmachado is probably the best person to ask.

I am curious: Are your 93 MAGs/GEMs from the same sample? What environment was the sample generated from?

Best, Francisco

shreyanshumale commented 3 years ago

Hello Francisco!

Thanks for your reply!

I see! I will try this approach out!!

Yeah, all 93 MAGs are from the same sample. This sample is from a deep-sea hydrothermal vent.

I had another related question. When I use the carve rule, all my models have really high growth rates (30-50), which I found out was because the models were not initialized with any medium. I wanted to know if I should use the 'initialized' models or the unconstrained ones given by the metaGEM out of the box for my smetana analyses? (Or it does not make a difference?)

Please let me know!

franciscozorrilla commented 3 years ago

Hi Shreyansh,

I suspect that the -i flag would not make a difference in your case, since SMETANA constrains/initializes the models based on the desired input simulation media. If you wanted to check for yourself then you could carve a set of MAGs with and without the -i flag, then simulate each community to verify that the interactions are the same. Note that there is some stochasticity in the carving process, so you may get slightly different GEMs for the same MAG due to multiple/equivalent solutions.

Apologies for the super late response! Best, Francisco