bxlab / metaWRAP

MetaWRAP - a flexible pipeline for genome-resolved metagenomic data analysis
MIT License
386 stars 189 forks source link

Strategy for processing more than 100 samples with metaWRAP #181

Open liangjinsong opened 5 years ago

liangjinsong commented 5 years ago

Hi, I have metagenomic data (paired-end, 150 bp per reads, two ~7 Gb fastq files for one sample) of 114 samples. These samples come from surface water of a river from four seasons. I want to finally get the abundance of each bin (strain level) in all the samples.

Several strategies exist for binning of these samples with metaWRAP. 1, process (assembly, bin, refine, and so on) each data set (114 in total) separately, and finally remove redundant (same) bins from bins assembly of all samples; 2, samples of the same season (~28 samples per season) are processed as one data set, ,and finally remove redundant (same) bins from bins assembly of four seasons; 3, all sample are processed as one data set.

Which strategy do you believe to be more reasonable? Thank you in advance!

ursky commented 5 years ago

This is a complicated and important question. See my detailed response here: https://github.com/bxlab/metaWRAP/issues/169. In short, it is usually better to co-assemble when possible, but not for all applications. In your case, I would first try option 3 with metahit, and see if its even computationally feasible. If not, then I would go with option 2, but with random subsets of samples (for example, 100Gb of sequence data at a time). The reason i would avoid assembling seasons separately is because your bins will be biased to strains from the samples in which they were recovered from, which will impact your abundance estimations. For example, a Cyanobacterium bin extracted from Spring samples will be slightly different than the same species extracted from the Summer samples, and the abundance estimations of these bins will be slightly higher in samples from which they were assembled from.

liangjinsong commented 5 years ago

@ursky Thank you very much for your professional and quick response! Now I have better understanding on binning with metaWRAP. ^o^

liangjinsong commented 5 years ago

I found the following paper provide a opinion from another perspective.

underlined viewpoint from the following paper: "individual assembly and de-replication would generate more and higher quality genomes than co-assembly of the read datasets."

dRep: a tool for fast and accurate genomic comparisons that enables improved genome recovery from metagenomes through de-replication. The ISME Journal (2017) 11, 2864–2868

ursky commented 5 years ago

Yep, I am well aware of this approach. The caveat is that the individual samples need to have a relatively high minimum read count to make this worthwhile, and you will be unable to reconstruct the low-abundance genomes. Ideally, you would do both approaches. I discuss this in depth in this paper: https://www.mdpi.com/2073-4425/10/3/220