franciscozorrilla / metaGEM

:gem: An easy-to-use workflow for generating context specific genome-scale metabolic models and predicting metabolic interactions within microbial communities directly from metagenomic data
https://franciscozorrilla.github.io/metaGEM/
MIT License
189 stars 41 forks source link

Questions from /r/bioinformatics #92

Closed franciscozorrilla closed 2 years ago

franciscozorrilla commented 2 years ago

Questions from /r/bioinformatics user:

What was your smallest input? (if I use short reads, it would be from a MiSeq using 300x2 but these only put out around 15GB - I know other SBS systems are more popular. More popular platforms have shorter reads [e.g. 150x2, maybe 300x1 for GEM work] but much larger outputs of at least 40gb)

That should be plenty, I have generated MAGs from much smaller samples. The smallest dataset analyzed in the metaGEM publications was the 6 species artificial lab community of 48 samples taken across 12 time points. The samples were 78 bp single end reads and ranged between 102 Mbp - 2.5 Gbp, they yielded between 1-6 MAGs per sample as expected from the small artificial communities (see fig 2D, source). Note: the community is supposed to be 7 species, but B. thetaiotaomicron is flatlined at close to 0 relative abundance as you can see in fig 2D. The more realistic human gut microbiome dataset had 100 bp paired end reads and the samples ranged between ~0.5-5 Gbp, they yielded between 2-65 MAGs per sample.

I am curious about your resdential/industrial water samples, how complex are they? How many unique species do you expect to find? Are you only interested in bacteria + archaea or are you also interested in eukaryotic genomes if even present? You may even consider shallower sequencing to get more samples, depending on the complexity of your community of interest.

How much time on how many nodes did metaGEM take to complete a smaller input?

Runtime will depend on the size and complextiy of the samples/communities, and the computational resources required also depends on the size of your dataset (i.e. number of samples). The samples are largely processed in parallel, and you dont need any pre-determined number of nodes, although the more cores + RAM you have the better.

The largest computational hurdle is usually the assembly step, which can also vary in runtime depending on the parameters and resources (number of cores + RAM) provided. If my memory serves, the assemblies for the artificial lab communities took between a few minutes to an hour or two and did not need very much computational resources (e.g. 6 cores and 40 GB RAM per sample). Note that you will probably need more RAM for more complex microbiomes like the human gut or enviriomental samples. You can always submit a few jobs with your lower estimate of required resources and monitor them to see if they finish in a reasonable time or need more resources.

After that, another big computational hurdle can be the crossmapping step, where each set of short reads is mapped against each sample's assembly, i.e. n^n mapping operations where n = number of samples. Naturally, if you have a large number of samples like with the TARA oceans dataset of 246 samples, the cross-mapping will take some time. For datasets with large number of samples you can always break them into more manageable subdatasets of e.g. 50 samples as an alterntative.

How much RAM did the pipeline saturate? (using a high or low RAM node for your HPC? - I don't know if the constituents of meteGEM tend to be CPU or RAM bound) (I'd like to see how this runs on the EPYC 7443P workstation I arranged for a coworker to have starting January. But not if it may take weeks due to 64GB ECC or 24/48 core limitations, he'd murder me!)

Just to clarify, metaGEM does not run on an entire node, rather it uses the HPC + Snakemake to submit a job for each task/step in the pipeline for each sample, thus benefiting from maximum parallelization. With that in mind, the most RAM consuming steps in the pipeline was actually GTDB-Tk used for classification of MAGs, if I recall correctly it tries to load in ~250GB into RAM. At least with this step you can run it one time and classify all the MAGs in one go. As mentioned above the assembly step can also be very very RAM hungry for large/deeply sequenced metagenomes.

If you expect to have small/low complexity metagenomes like the ones in the small artificial community dataset then I think you should have no problem running on a workstation with 64GB RAM + 24/48 cores, although you would likely have to run all your samples/jobs in series rather than parallelizing as on the cluster, and therefore it may take an unreasonably long amount of time to process your samples depending on the size of your dataset. Unless you can get a good number of nodes (i.e. more cores + RAM) into the workstation to run multiple samples in parallel, I would consider using an HPC cluster for large scale analysis.

Did you use fastP just for barcode removal, or both barcodes and a quality threshold (Q30?)? - just curious what results were with lower quality DNA included. Q scores can go down after the first N gb(depending on flow cell load density), so you could be missing out on a good chunk of pipeline input if fastP parameters are cutting out "bad for SNV work, fine for GEM work" DNA.

We use fastp with default settings, you can see what all the default settings are in the fastp documentation. Regarding your loss-of-good-data concern, I would be inclined to agree with you, but with short read assembly you really want to get the cleanest input possible to get the best quality assemblies. Also, if you have a look at the quality filtering plot for this small aritifical community, you can see that most of the information is retained after filtering (only the little green part at the end of the bars gets thrown out), at least in this example of a low complexity dataset.

image

Out of curiosity I had a look a the same plot but for the gut microbiome dataset, and here there is much more filtering of data. In fact it looks like ~ 10 - 50% of information can get filtered from a given sample.

image

It would indeed be intersting to see how lowering the quality filtering threshold affects the assembly of contigs, binning into MAGs, and final quality of metabolic model reconstruction.

Have you tested with long reads? (I prefer nanopore cost & output, plus q-scores improving rapidly)

I have not personally analyzed any long read datasets yet, but it is defintely an interesting idea! You could assemble bacterial MAGs from long reads using your own pipeline and then feed those to metaGEM for generating and simulating communities of GEMs. You are more than welcome to contribute to the development of metaGEM by pushing new Snakemake rules for processing of long reads.