DRL / blobtools

Modular command-line solution for visualisation, quality control and taxonomic partitioning of genome datasets
GNU General Public License v3.0
187 stars 44 forks source link

Suggestion- more efficient contaminant read removal #59

Closed rotifergirl closed 6 years ago

rotifergirl commented 6 years ago

First of all, let me say that blobtools has been an excellent program to work with. However, I initially was a little bit disappointed with the results of the contaminant cleaning, my second and third re-assemblies still had a significant portion of contamination being assembled, but I did find a very easy way around this. I didn't want to label too many contigs in my initial assembly as contaminants and risk losing useful information, but this means I ended up with a not of "no-hit" small contigs that were contaminants.

But I figured most of my contaminants would be of similar origin, so instead of extracting reads from my whole genome mapping, I re-mapped my reads to ONLY the contigs I identified as contaminants. From this mapping, I took only the UNMAPPED reads for my next assemblies and got much nicer results. This means I essentially used contigs that I was pretty sure were contaminant assemblies as a magnet. I'm now really happy with how my reassembled genomes look, and I figured it might help someone who is struggling to get the decontamination result they were after. I didn't even see the need to do a second round of cleaning up with this approach, whereas with the initial approach, I was still unhappy with my assemblies after the third iteration.

Hope this helps someone!

DRL commented 6 years ago

Hi rotifergirl,

yes, this issue is common as many assemblers do not use ALL reads when constructing contigs, especially if metagenomic complexity in the dataset is high.

Regarding your solution, that is essentially the Workflow B in the BlobTools paper. So kudos for figuring it out yourself!

cheers,

dom