Ecogenomics / CheckM

Assess the quality of microbial genomes recovered from isolates, single cells, and metagenomes
https://ecogenomics.github.io/CheckM/
GNU General Public License v3.0
334 stars 73 forks source link

What is the scale of Contamination and Strain heterogeneity stats #65

Closed mooreryan closed 8 years ago

mooreryan commented 8 years ago

I had a question about contamination and strain heterogeneity. I ran CheckM on around 500 bins from a metagenome. The contamination ranges from 0 to 870.28, and the strain heterogeneity ranges from 0 to 100.

Are these statistics percentages and if so what does > 100% contamination mean. Also, is there any way you typically determine what an acceptable level of strain heterogeneity would be?

donovan-h-parks commented 8 years ago

Completeness and contamination are a percentage. Contamination >100% indicates the recovered bin likely contains multiple organisms. For example, contamination of 800% indicates, that on average, each single copy marker gene was observed 8 times! The 0, 1, 2, ..., 5+ columns reported by CheckM indicate the number of times each marker gene was observed.

Strain heterogeneity is more easily viewed as an index between 0 (no strain heterogeneity) and 100 (all markers present >1 appear to be from closely related organisms).

mooreryan commented 8 years ago

Thank you for the quick response...yes that makes sense!

XiaowuBioinformatics commented 6 years ago

Hello Mooreryan. What would an acceptable level of strain heterogeneity be?

donovan-h-parks commented 6 years ago

The strain heterogeneity (SH) index indicates the proportion of the contamination that appears to be from the same or similar strains (as determined with an AAI threshold). As such, the primary concern is the amount of contamination and the SH index gives an indication of the source of the contamination (i.e., highly similar or more divergent organisms).

XiaowuBioinformatics commented 6 years ago

Hello Dparks. Do we need to consider the SH index while doing the downstream analysis after Binning? For example, if we get a Bin with completeness over 90% and contamination below 10%, what should we choose if the SH over 50%?

donovan-h-parks commented 6 years ago

The SH index is worth considering, but isn't nearly as critical as the estimated percentage of contamination. If the SH index is high (ideally 100%), it suggest the majority of contamination is from very similar species and thus any contamination is likely from the pangenome of the species being considered. Alternatively, if the SH index is very low (ideally 0%) this indicates all the contamination is likely from other species (perhaps very divergent species) and thus you may be able to identify it and remove it from the genome. If you wish to try and remove contamination you can look at my companion tool RefineM (https://github.com/dparks1134/RefineM).

Devadas07 commented 5 years ago

what is the minimum contamination to consider?

donovan-h-parks commented 5 years ago

I generally consider MAGs with contamination<10% or where completeness - 5*contamination>50. See the following: https://www.nature.com/articles/s41564-017-0012-7 https://www.nature.com/articles/nbt.3893

Devadas07 commented 5 years ago

tq for ur quick response dparks

luuuuuuuke commented 4 years ago

Hi Donovan! What would you make of a mid-level SH, say 50? Would this mean that half of the contamination is from closely related organisms while the other half is from unrelated organisms?

donovan-h-parks commented 4 years ago

On the surface, yes. However, you might want to play with the AAI threshold use to establish if a duplicate gene should be considered from a closely related organism to see how this impacts the results.

luuuuuuuke commented 4 years ago

Thank you that makes sense

RayanaFeltrin commented 1 year ago

Hi Donovan!

I have recently annotated a genome using PGAP and its output also includes a CheckM result. Checking out the contamination session, it shows 0.59. Does that mean a contamination of 0.6% or 60%?

Thank you in advance!

azat-badretdin commented 1 year ago

PGAP here. I see that our internal SQL view for bacterial assemblies V_ChecmStats contains values more than 1 in Contamination column. I am almost sure that we do not renormalize any CheckM output parameters in our processing, so this value must be 0.6%.