BinPro / CONCOCT

Clustering cONtigs with COverage and ComposiTion
Other
122 stars 48 forks source link

Best way to deal with mixed bins? #163

Closed fwhelan closed 8 years ago

fwhelan commented 8 years ago

Good morning,

First, thank you for the organization and capacity of CONCOCT- its been a very easy and fun tool to use! I have a question about following up on the complete example. I have 5 samples (I know the numbers are lower than suggested...) which CONCOCT splits into 25 bins. When I validate this against LMAT taxonomic assignments, I get decent precision:

N M TL S K Rec. Prec. NMI Rand AdjRand 7642 7064 4.1047e+07 18 26 0.872287 0.947376 0.899887 0.965589 0.833922

I further wanted to check the contigs assigned to each bin with a tool to validate completeness/contamination. I'm aware that each bin isn't guaranteed to be complete; however, I am hoping to have my bins roughly equivalent to genomes and thus am hoping to decrease contamination.

For a few bins, I have high levels of contamination as calculated by CheckM, and judging by LMAT taxonomic assignments, I think these bins may consist of 2 genomes.

Anyway, this is basically a long-winded way of asking: how to do you deal with bins with >1 organism? If it kosher to re-run CONCOCT on a subset of contigs from the originating samples? Is there another workflow/protocol that you would suggest?

Thanks!!

alneberg commented 8 years ago

Good evening, ;)

thanks for the feedback! I've never heard of LMAT before, it looks interesting.

I would say the statistics look pretty good as well since a little depending on your sample I would assume there is a certain error rate associated with the taxonomic assignment as well? That said, we do see quite frequently that some bins contain more than one genome and we are working on tools to improve this.

We have tried running concoct again on a subset just as you are suggesting with some success but it's a little tedious and only works in some cases.

What we usually do is that we simply ignore the contaminated and/or incomplete bins and continue with the bins we judge good from the SCG analysis.

Johannes

fwhelan commented 8 years ago

Hahaha, then, I think, a Good afternoon is in order now?

Thanks so much for your suggestions. I visualized some of the mixed bins with anvi'o and it makes a lot of sense why CONCOCT mixed them. For example, one bin contains what I estimate to be 3 Streptococcus species which are present at very similar abundances in all 5 of my input samples. I don't think this is a case of CONCOCT not performing well but instead of an under-powered input.

Thanks again! Fiona

alneberg commented 8 years ago

Indeed it is!

Yes that would be a difficult situation for the algorithm. Hope you find a way to deal with it.

Cheers