bxlab / metaWRAP

MetaWRAP - a flexible pipeline for genome-resolved metagenomic data analysis
MIT License
395 stars 190 forks source link

Does metaWRAP suit for large datasets? #246

Open liangyong1991 opened 4 years ago

liangyong1991 commented 4 years ago

I have 100 samples,with total 3T's data.Does metaWRAP suit for this project,or is there any other solutations?

ursky commented 4 years ago

Yes, but use --megahit during assembly and bin with the only metabat2 for speed. I answered questions like this in-depth in some of the other issue threads.

liangyong1991 commented 4 years ago

Thanks a lot for your advice,and I will try it .Did you know how long it would probably take if use three binning software?

ursky commented 4 years ago

I personally wouldn't try - it might take weeks. Maxbin and concoct do not scale very well with enormous datasets. One trick you can use to speed up the binning process (with any binner) is to throw away contigs smaller than 2kb, 3kb, or even 5kb - this will make the bins somewhat less complete, but significantly reduce the search space.

neptuneyt commented 4 years ago

I have the same question about large scale data about 200 sample with total 6T, so how could I assembe it ? Can I split with 4 groups with 50 samples about total 400Gb per subgroup to assemble it, does sample numbers and total data size per group have any effects on assembled results ? I am looking forward your replies. Thank a lot!

ursky commented 4 years ago

With the depth you have you could just assemble and bin all 200 individually, and then use DRep to get unique bins/MAGs, but then you would miss more rare species that didn't have enough coverage in any one sample but might have come up if you concatenated some of the data. Grouping a lot of data has the opposite problem where some of the very high abundance species won't assemble well because of extra strain heterogeneity. If you decide to process very large chunks of data I would recommend megahit for assembly and metabat2 for binning - the other methods don't scale too well. You could also do both of these approaches and then use a combination of DRep and manual curation to cherry-pick the best MAGs of each species depending on in protocol they assembled/binned the best.

Ultimately I can't say what will give you the best results. It really depends on the complexity and sequence depth of individual samples, the priorities of your study (do you care more about the major species of the rare ones), and your available resources. I would start by experimenting a bit with a few samples or groups to find what works best in your case.

neptuneyt commented 4 years ago

Thanks a lot for your comment, very deep and reasonable insights about metagenomics assembly issue!