bxlab / metaWRAP

MetaWRAP - a flexible pipeline for genome-resolved metagenomic data analysis
MIT License
395 stars 190 forks source link

question regarding output #237

Open mattg386 opened 4 years ago

mattg386 commented 4 years ago

Hello,

I have a question about my output. While running through the metaWRAP pipeline we noticed that we had initial bins (87 concoct, 5 metabat and 7 maxbin) but when we ran bin refinement, we had to really reduce the completion and contamination score in order to get any quality bins (with 30 samples--50% completion and 10% contamination gave 1 bin, and 30% completion and 15% contamination gave 6 bins). When we ran the quant_bins module we found that there were multiple zeros in our bin abundance table (see below) therefore we were unable to make heat maps etc. After the results of the bin refinement I looked back at the output from the read_qc module, the big thing that I notice is that there are a lot of over represented sequences and there is a high sequence duplication level.

Bin abundance table
Genomic bins | CCB569 | CCB568 | CCB561 | CCB560 |...| CCB330 | CCB106 | CCB105
bin.1 | 0 | 0 | 0 | 0 |                             ...                              | 202.578917 | 0 | 0
bin.4 | 0 | 0 | 0 | 0 |                             ...                              | 0 | 0 | 55.7605215
bin.2 | 0 | 0 | 0 | 0 |                             ...                              | 0 | 0 | 0
bin.5 | 0 | 0 | 0 | 0 |                             ...                              | 98.770757 | 0 | 0
bin.7 | 0 | 0 | 0 | 0 |                              ...                             | 0 | 0 | 64.147188
bin.3 | 0 | 0 | 0 | 275.234852 |             ...                             | 0 | 0 | 0

I was curious if anyone else had run into this or a similar issue and what might be the cause? Do you think maybe we just sequenced a lot of contamination/low quality DNA?

Thanks Matt

ursky commented 4 years ago

There could be a few problems here, and I am just taking a stab at it with what I see here. I would need to look at the actual data to know for sure. It is possible that the contigs cannot be effectively binned because there is not enough information to bin them. However, seeing that you have quite a few replicates for differential coverage information, I doubt this is your issue. Your problem likely stems from a poor metagenomic assembly, which in turn likely stems from poor experiment/sequencing design. It is likely that the metagenomes were sequenced too shallowly and this resulted in very few long (3kb+) contigs. What is the size of your assembly when only looking at contigs >1kb? 3kb? 10kb? If it drops off too sharply, that is alarming. Do you have any 100kb+ contigs? Keep in mind that the sequence depth is effectively a ratio between the number of reads you generated and the microbial community complexity (the more diverse, the deeper you need to sequence). So assuming the worst case scenario in which every replicate has its own unique strains/species, you need to count how many reads you have per replicate, not the entire experiment. For example, it I have 100GB of sequencing data spread over 100 samples, that's not very good...