low number of species after binning for metaGEM model reconstruction

Zhaoju-Deng commented 1 year ago

Hi all, I followed the workflow of metaGEM (except using metaspades and das_tool instead of megahit and metawrap). when I used gtdbtk to classify the bins determined by das_tool based on binning results from metabat2 and maxbin2, I only obtained 8-12 species in most of my samples (fecal samples from dairy cow). Therefore, I am wondering if this is a rather underestimation of the microbiome composition? if I use those 8-12 genomes per sample to construct GEM models and thereafter simulate the interaction among those species, I would largely miss the interaction among different species? I found much more species using kraken2 and metaphlan4 using clean reads, should I use the species results from kraken2 or metaphlan4 and download the reference genome of those species (or use the GEM models in agroa2 (agroa2: Genome-scale metabolic reconstruction of 7,302 human microorganisms for personalized medicine, https://www.vmh.life/files/reconstructions/AGORA2/version2.01/) directly) to reconstruct GEM models in order to include the potential interactions among species in microbiome?

many thanks, Zhaoju DENG

franciscozorrilla commented 1 year ago

Hey Zhaoju,

The numbers you describe sound pretty normal to me: I would expect that an assembly-free approach used by short-read profilers like kraken/metaphlan/mOTUs will have many more "hits" for genomes compared to the assembly-based approach used by metaGEM and other similar workflows.

However, I would warn that many of the low relative abundance hits from short read profilers may be false positives from closely related species. I would expect that if you use a relative abunance cutoff to filter our low abundance species from the short-read profiler output, then the number of species will start to approach those obtained from the assembly-based approach. This reflects the fact that assembly-based approaches work great for high coverage or high abundance genomes, but not so well for low abundance/coverage genomes.

Also, 8-12 genomes is on the lower side, did you use coverage across multiple samples for binning? This extra mapping information should allow you to get more out of your samples. As an example, consider this publication (https://doi.org/10.1016/j.cell.2019.01.001) where they reconstruct 154,723 genomes from 9,428 human gut metagenomes = ~ 16 genomes/sample, and note that they did not use coverage across multiple samples for binning. In the metaGEM paper (https://doi.org/10.1093/nar/gkab815), we reconstructed 4,133 genomes from 137 human gut metagenomes = ~ 30 genomes/sample, note that this was using coverage across samples.

Note also that sequencing depth and complexity of your samples will play a big role in the number of genomes reconstructed, if your samples are very shallow and they are complex then you will recover a low number of genomes. If possible, try increasing sequencing depth in your next experiment, or search for a dataset with higher sequencing depth.

I think that the approach you mention regarding the usage of short-read profilers to select AGORA models for simulation is understandable, but not very elegant. The whole point of metaGEM is enable direct reconstuction of metabolic models from metagenomes in order to capture context-and-strain-specific information available in your sequencing samples, which is missing from reference genomes and reference-genome-based-metabolic models (e.g. AGORA). Consider the following text from the metaGEM paper:

Pangenome analysis of the human gut microbiome demonstrated that the functional repertoire of gut species differ significantly, with a median core genome proportion of only 66% [14], revealing differences in metabolic potentials of individual microbiomes.

There is significant variation in the functional repertoire of the same species across humans, and I would expect the differences in metabolism of the same microbial species across human and cow to be even greater.

Hope this helps, let me know if you have further questions! Best, Francisco

franciscozorrilla commented 1 year ago

franciscozorrilla / metaGEM

low number of species after binning for metaGEM model reconstruction #125