Closed mooreryan closed 8 years ago
Completeness and contamination are a percentage. Contamination >100% indicates the recovered bin likely contains multiple organisms. For example, contamination of 800% indicates, that on average, each single copy marker gene was observed 8 times! The 0, 1, 2, ..., 5+ columns reported by CheckM indicate the number of times each marker gene was observed.
Strain heterogeneity is more easily viewed as an index between 0 (no strain heterogeneity) and 100 (all markers present >1 appear to be from closely related organisms).
Thank you for the quick response...yes that makes sense!
Hello Mooreryan. What would an acceptable level of strain heterogeneity be?
The strain heterogeneity (SH) index indicates the proportion of the contamination that appears to be from the same or similar strains (as determined with an AAI threshold). As such, the primary concern is the amount of contamination and the SH index gives an indication of the source of the contamination (i.e., highly similar or more divergent organisms).
Hello Dparks. Do we need to consider the SH index while doing the downstream analysis after Binning? For example, if we get a Bin with completeness over 90% and contamination below 10%, what should we choose if the SH over 50%?
The SH index is worth considering, but isn't nearly as critical as the estimated percentage of contamination. If the SH index is high (ideally 100%), it suggest the majority of contamination is from very similar species and thus any contamination is likely from the pangenome of the species being considered. Alternatively, if the SH index is very low (ideally 0%) this indicates all the contamination is likely from other species (perhaps very divergent species) and thus you may be able to identify it and remove it from the genome. If you wish to try and remove contamination you can look at my companion tool RefineM (https://github.com/dparks1134/RefineM).
what is the minimum contamination to consider?
I generally consider MAGs with contamination<10% or where completeness - 5*contamination>50. See the following: https://www.nature.com/articles/s41564-017-0012-7 https://www.nature.com/articles/nbt.3893
tq for ur quick response dparks
Hi Donovan! What would you make of a mid-level SH, say 50? Would this mean that half of the contamination is from closely related organisms while the other half is from unrelated organisms?
On the surface, yes. However, you might want to play with the AAI threshold use to establish if a duplicate gene should be considered from a closely related organism to see how this impacts the results.
Thank you that makes sense
Hi Donovan!
I have recently annotated a genome using PGAP and its output also includes a CheckM result. Checking out the contamination session, it shows 0.59. Does that mean a contamination of 0.6% or 60%?
Thank you in advance!
PGAP here. I see that our internal SQL view for bacterial assemblies V_ChecmStats contains values more than 1 in Contamination column. I am almost sure that we do not renormalize any CheckM output parameters in our processing, so this value must be 0.6%.
I had a question about contamination and strain heterogeneity. I ran CheckM on around 500 bins from a metagenome. The contamination ranges from 0 to 870.28, and the strain heterogeneity ranges from 0 to 100.
Are these statistics percentages and if so what does > 100% contamination mean. Also, is there any way you typically determine what an acceptable level of strain heterogeneity would be?