franciscozorrilla / metaGEM

:gem: An easy-to-use workflow for generating context specific genome-scale metabolic models and predicting metabolic interactions within microbial communities directly from metagenomic data
https://franciscozorrilla.github.io/metaGEM/
MIT License
189 stars 41 forks source link

crossMapseries vs crossMapParallel #60

Closed slambrechts closed 3 years ago

slambrechts commented 3 years ago

Hi,

I tried bash metaGEM.sh -t crossMap -c 48 --local, but it didn't work. I see that there are now 2 different versions of crossMapin the metaGEMcore workflow instead. I'm not entirely sure what the difference is between these 2 methods, and was wondering if there is any documentation detailing the differences?

Kind regards, Sam

franciscozorrilla commented 3 years ago

Hi Sam,

Thanks for raising the issue. This topic does indeed need some additional documentation in the wiki, I will add it to the to-do list!

At the moment, you can see some documentation in the Snakefile comment/message sections, for example:

https://github.com/franciscozorrilla/metaGEM/blob/d81186a0700f974b4f57db587b71b960a951db83/Snakefile#L390-L406

https://github.com/franciscozorrilla/metaGEM/blob/d81186a0700f974b4f57db587b71b960a951db83/Snakefile#L538-L552

In short, the crossMapSeries rule will launch 1 job per sample, and in each of those jobs it will run a for loop to map each set of qfiltered reads against one assembly. On the other hand, the crossMapParallel rule will submit one job per mapping operation (e.g. mapping reads from sample X against assembly Y). On a more practical note, crossMapSeries requires you to temporarily store bam files equal to the number of samples in your dataset before generating the contig coverage across samples, this can quickly become impractical for large datasets.

As you can see, crossMapSeries is the default option because it can generate contig coverage for all three binners, whereas crossMapParallel is better at scaling mapping operations for large datasets but only generates contig coverage across samples for CONCOCT. In the metaGEM paper we used crossMapParallel for the TARA oceans dataset of 246 paired end samples. In case you have not already, I would recommend reading the methods section of the metaGEM preprint, in particular the Contig coverage estimation and binning subsection, which describes the differences in methods.

You may also find the discussion in issue #57 relevant. Let me know if you have further questions regarding this topic!

Best wishes, Francisco

slambrechts commented 3 years ago

Hi Francisco,

Ok I understand now.

On a more practical note, crossMapSeries requires you to temporarily store bam files equal to the number of samples in your dataset before generating the contig coverage across samples, this can quickly become impractical for large datasets.

=> This was going to be my next question. In the scratchintermediate files folder I noticed these temporary crossMapfiles take up 226 GB for the one sample that has already finished. And indeed, since I have 43 samples, this will become impractical. After crossMapSeries has finished running for a specific sample, can I delete the corresponding map from the scratch intermediate files folder for said sample?

Best Wishes, Sam

franciscozorrilla commented 3 years ago

Absolutely, you can (and probably should) delete most folders in scratch/ after jobs have finished running. This is especially true for the crossMap folder since it will be storing N^2 sorted bam files after finishing, where N = number of samples in your dataset.

Best, Francisco

slambrechts commented 3 years ago

ok great, thanks.

So I can also safely delete the assemblies folder from the scratch/ intermediate files folder?

franciscozorrilla commented 3 years ago

Yes, you can safely delete the assemblies/ and all other subfolders in scratch/ after the jobs have finished running. The files remaining in the scratch/ subfolders are mostly useful for troubleshooting if jobs fail, or if you want to extract any intermediate result files that are not used directly by metaGEM.

slambrechts commented 3 years ago

ok clear, thank you