Ecogenomics / CheckM

Assess the quality of microbial genomes recovered from isolates, single cells, and metagenomes
https://ecogenomics.github.io/CheckM/
GNU General Public License v3.0
335 stars 73 forks source link

Contamination higher 100% #107

Closed fjell-dev closed 7 years ago

fjell-dev commented 7 years ago

Hi,

I would like to ask in case of some genomes that we have been analyzed (assembled Illumina sequencing), the contamination percentage is over 100% or even 200%. What does it really mean in those cases? Does it mean that there are more than two species identified by checkM in the samples?

Thank you,

Best regards,

LC

donovan-h-parks commented 7 years ago

There are many cases that can lead to extremely high contamination. The simplest would be that the bin is a mixture of 2 or 3 species as you indicated. In this case the "strain heterogeneity" should be close to 100%. Alternatively, the bin may be a mixer of many (potentially divergent) genomes that are erroneously binned together. These "superbins" are not uncommon and generally occur for genomes with low coverage.

fjell-dev commented 7 years ago

Thanks a lot for your explanation. I partly understand how CheckM works after reading the paper, though I would like to ask you some questions for clarifications:

The reason for those questions is that we are trying to interpret the output of checkM in case of our bioprospecting samples. We have some cases of checkM reporting high level of contamination but we are wondering if we can identify the nature of the contamination like do we have a sample of two strains that are very similar to each other or if they are very different from each other. And if there is a way that we can identify novel strains from the results of checkM if a strain is not in the reference database before.

Thank you.

donovan-h-parks commented 7 years ago

Q1: Assuming complete genomes were assembled for both populations and placed into the same bin/genome, CheckM would report 100% contamination and 0% strain heterogeneity. The ratio of genetic material makes no difference to this calculation, and the degree to which the two populations produce assembled contigs that are ultimately placed into the same bin/genome.

Q2: This could indicate a number of different scenarios and does not necessarily imply a mixture of just two strains. For example, it could be a mixture of 3 or more strains all with similar genomic properties and coverage profiles. Quantitatively, contamination of 50% indicates that half the genes expected to be single-copy were identified twice in this bin.

Q3: Again, the amount of genetic material is not directly relevant beyond how it impacts the assembly and binning. If all 10 strains were completely assembled and into a single bin, the completeness would be 100% and the contamination would be 900%. For 7 strains (regardless of how closely related they are), completeness = 100% and contamination = 600%.

In general, closely related strains are a challenge. This is not really a limitation of CheckM, but of assembly and binning methods that often have trouble in such cases.

fjell-dev commented 7 years ago

Thanks again for your replies. They are indeed very clear and helpful. If I need to explain checkM for a person with non-bioinformatics background (like why we use it and not others), is it correct to say that checkM will make a profile of marker genes specifically to the sample's population (for example, a sample with 1000 different strains) and place them in a corresponding tree, at the same time compare to established reference marker genes? So that makes checkM more accurate in the prediction of contamination? At least that's what I understand from reading the article. And would those specific marker genes are somehow overlapped with "house-keeping" genes of the population? Thank you.

donovan-h-parks commented 7 years ago

No. I don't think that is correct. CheckM does not dynamically create new sets of marker genes based on the genomes you provide it. It takes your genomes and determines where they are in a reference phylogenetic tree. Based on the placement of a genome in the tree, it then determines the most appropriate set of marker genes to use to evaluate the completeness and contamination of this genome. This is repeated, independently, for each genome.

So, if you provide CheckM with a set of E. coli genomes it can determine this and will use an E. coli specific set of marker genes. If your provide it with a set of highly novel strains (say, from a new phylum) it will evaluate these using a set of Bacteria specific marker genes.

fjell-dev commented 7 years ago

OK, so now I understand. I think the parts that I got confused are from "Here we describe CheckM, an automated method for estimating the completeness and contamination of a genome using marker genes **that are specific to a genome’s inferred lineage within a reference genome tree**" and "Within CheckM, a gene identified as **single copy in ≥97% of genomes** is considered to be a marker gene." in which I understood that checkM will create a set of specific marker genes that appear as a single copy in 97% of every genomes in the datasets and use that to determine the contamination and strain heterogeneity. Thanks a lot for clarifying that for me.