bxlab / metaWRAP

MetaWRAP - a flexible pipeline for genome-resolved metagenomic data analysis
MIT License
383 stars 188 forks source link

concatenate after assembly #206

Open yjiakang opened 4 years ago

yjiakang commented 4 years ago

Hi, I noticed you concatenated all the pair-end raw reads into two file --_1.fastq, _2.fastq, respectively. Then you did assembly based on the pooled reads and further ananlysis. Here I am confused that whether it would be ok to concatenate all the assembly contig which I have done before and the concatenate all the corresponding raw reads to do further analysis (i.e. binning, bin_refinement, etc). Thanks in advance.

ursky commented 4 years ago

Concatenating contigs from individual assemblies would not work because you will have a large number of duplicate sequences. There are tools out there that can de-replicate and combine multiple assemblies, but I cannot personally recommend this unless it is not possible to co-assemble. Additionally you will lose any low-abundance organisms that could have been assembled if you used all the reads to begin with.

yjiakang commented 4 years ago

@ursky Thanks for you professional answer very much. I will have a try on my 12 samples.

TJrogers86 commented 3 years ago

Concatenating contigs from individual assemblies would not work because you will have a large number of duplicate sequences. There are tools out there that can de-replicate and combine multiple assemblies, but I cannot personally recommend this unless it is not possible to co-assemble. Additionally you will lose any low-abundance organisms that could have been assembled if you used all the reads to begin with.

This worries me now. I have 40 metagenomes that I assembled individually via metaspades. I then concatenated all the assemblies into a single file I named 'final_assemblies.fasta' and piped it into metawraps binning and bin refinement modules. In the end I was left with 406 MAGs with a completeness score of ≥70% and contamination of ≤ 5%. Was this the wrong way to do this? Should I redo from scratch by concatenating the reads first?

ursky commented 3 years ago

Its not ideal, but if you got satisfactory results you are happy with it is still OK. The issue is that you can have contigs from different samples (and therefore different taxa) in the same cluster. Ideally, you should have processed the samples independently and then used DRep to get a unique set of MAGs.

TJrogers86 commented 3 years ago

Awesome! Thanks. I will keep that in mind in the future. I will concatenate at the Read level in the future as you recommended above. I appreciate the clarification.