franciscozorrilla / metaGEM

:gem: An easy-to-use workflow for generating context specific genome-scale metabolic models and predicting metabolic interactions within microbial communities directly from metagenomic data
https://franciscozorrilla.github.io/metaGEM/
MIT License
189 stars 41 forks source link

MAG abundance across samples #66

Closed zoey-rw closed 1 year ago

zoey-rw commented 3 years ago

Hi! It seems the 'abundance' rule maps each sample's reads against the bins generated from that same sample. Reads are not mapped against bins generated from other samples. Is this understanding correct?

If so - why not create one output directory of dereplicated bins from the whole analysis (i.e. flatten the output from reassembled_bins/{sampleID}/reassembled_bins), and map each sample's reads against that directory? Otherwise, how else could you evaluate the abundance of a single species among samples?

I am testing this with the outputs from a relatively unsuccessful analysis (still running the pipeline with better assembly parameters). I input 48 bins (which were mostly very low quality, minimum completeness only 15% to ensure every sample would pass through the metaGEM pipeline). These 48 were combined into only 2 genome clusters. I do expect the results to change when I input better genomes.

Here is how I just tried implementing this using anvi'o. This R code creates the anvi'o input file (with genome names and file paths):

b_directories <- list.files("./reassembled_bins/", pattern = "HARV", full.names = T)
out <- data.frame(name = NULL, path = NULL)
for (sample in b_directories){
    samp.bins <- list.files(paste0(sample, "/reassembled_bins"), pattern=".fa", full.names = T)
    bin_names <- paste0(basename(sample), ".", basename(samp.out))
    samp.bin.info <- cbind(name = bin_names, path = samp.out)
    out <- rbind(out, samp.bin.info)
}
write_tsv(out, "./external-genomes.tsv")

Then, from within the same metaGEM directory:

module load fastani
module load miniconda 
module load anvio 
conda activate anvio-7.0
anvi-dereplicate-genomes -f external-genomes.tsv -o derep_genomes/ —-skip-fasta-report --program fastANI --similarity-threshold 0.85 —-representative-method Qscore

The two bins selected as "representative" for each cluster could then be used for mapping. Do you think this is a sound approach?

franciscozorrilla commented 3 years ago

Hi Zoey,

That is correct, the abundance rule will only quantify the abundance of MAGs within the sample they are generated from. The assumption/rationale is that users will also assign taxonomy to each MAG, and so you can evaluate the abundance of species across samples. This is how we generated sup. fig. 5 in the paper, where we look at the abundance of community members across a time series:

image

However, I can see how this may lead to issues for novel/unknown species or simply even MAGs that do not get species level assignment. I think that your approach defintely makes sense, and would likely be the best course of action for such cases. I have created an issue (#68) for adding dRep to the workflow for MAG dereplication.

Please let me know if you have further questions! Best, Francisco