bxlab / metaWRAP

MetaWRAP - a flexible pipeline for genome-resolved metagenomic data analysis
MIT License
389 stars 189 forks source link

dereplicating merged sample assemblies for binning #383

Open redekarnr opened 3 years ago

redekarnr commented 3 years ago

I am interested in generating metagenome assembly and binning from 650 samples with total data size of 4.2 Tb. I am running into low memory errors for merged assembly step, even with 900Gb memory (my current memory limit). Therefore, I am thinking to generate individual sample assemblies (using both --metaspades and --megahit) rather than one merged assembly.

Similar topics have been discussed before under multiple threads, but I am not clear about the order of events after sample assemblies are generated. Should I merge individual sample assemblies, run dRep on merged assembly, and then do binning? OR Should I run binning on individual sample assemblies first and merge all sample bins, and then run dRep?

Also, how essential it is to run dRep? I ask because, dRep replicate by default selects for genomes of min 50K length with 75 completeness and 25 contamination; while the assembly scaffolds generated using metaWRAP are at least 1.5 Kb in length. I worry dRep might eliminate smaller scaffolds. How would dRep affect the binning process?

I would really appreciate some insights. Thank you.

ursky commented 3 years ago

Process individual samples and get the bins, then run dRep. This will remove tiny bins, but thats just a limitation of dRep. If you dont run it you will be left with a bunch of duplicated bins. Its ultimately up to you to decide how you want to handle the duplicates (if at all).