Closed alexcritschristoph closed 6 years ago
Hey Alex.
I agree both definitions are sound and it can be a source of confusion. In your example, CheckM will report 100% contamination. The CheckM documentation and output tables try to make it clear that the percentage is relative to the number of expected SCGs. Note that there is another source of confusion here in that CheckM reports statistics relative to the number of expected SCGs. A genome with 30 SCGs identified once and 10 SCGs identified twice when 100 SCGs are expected in a complete genomes will be reported as having 10% contamination, not 10/40 = 25% or 10/30 = 33%. All three of these are reasonable definitions for contamination and require one to understand the definition being applied.
CheckM contamination results should be viewed in the context of the estimated completeness. I find this natural since a 50% complete genome with 10% contamination means you have about half of the genome you are after, but 10% of some other genome (or mix of genomes). This is a lot of contamination given that you only have 50% of the target genome.
I think this is how biologist typically think about the estimates, while bioinformaticians and computer scientists often think in terms of the first definition you gave. That said, all three definitions are acceptable so one really needs to understand how to interpret the data.
Cheers, Donovan
Hi Donovan,
Thank you for the fast response! I really appreciate you clearing up both that issue and explaining further - I only recently fully understood that contamination is a percentage of expected SCGs.
It is an interesting perspective. I see the value in it (and changing things doesn't seem very possible because of all of the recent standardization efforts based on current guidelines), but I wonder if then it would also be possible to report the alternative metric(s) on their own.
The two I'm most interested in are "The probability that a given contig in this assembly is divergent contamination" (inferred by # of SCGs duplicates w/ PID < 90% / total # of observed SCGs) and "The probability that a given contig is strain-based contamination" (inferred by # of SCGs duplicates w/ PID > 90% / total # of observed SCGs).
I think these are very biologically meaningful and actionable metrics with clear implications for results - basically they are the very naive probability that you're wrong about whether any given gene truly belongs in that assembly, and the strain versus non-strain distinction gets at what we mean by "wrong".
However, they don't fit into the recently established guidelines for MAGs so I understand that they can't replace existing metrics. Not even making a formal request for such a feature, just wanted to chat about it.
Thanks for your time, Alex
Hey Alex,
I think some caution is warranted in reading too much into the 90% PID threshold. I hope the 'strain heterogeneity' report by CheckM is a useful reference point, but it should probably be taken with a lot of caution. While 90% AAI indicates similar proteins, it is debatable if it indicates the same species since this will depend on lineage-specific rates of evolution for each marker gene. Maybe more to the point, an 80% threshold might be more appropriate in many cases! I haven't done any work to try and establish what a suitable threshold for each gene might be if one is really keen to distinguish 'strain contamination' from 'more divergent contamination'.
As a reference, the 90% threshold was motivated by the following work and in particular Figure 3: http://enve-omics.gatech.edu/sites/default/files/2014-Rodriguez_R-Konstantinidis_Microbe_Magazine.pdf
However, this figure indicates the average over all proteins in a pair of genomes and the distribution around this average is surprisingly large.
Cheers, Donovan
Thanks, this is a great reference. I think the 90% threshold definitely has some utility, but understand it's not an absolute.
This is all from the perspective of me worrying that reviewers or colleagues will read a paper about a MAG, see a 4% contamination metric, assume that it naively means 0.04 * 5 Mbp = 200 Kbp of "contaminating" DNA, and think "200 Kbp of this genome could be from any other microbe in that environment! That could explain all of the interesting genes they report".
You and I know that this is misinterpreting the checkM metric on many fronts, but unfortunately I'm not sure if this is always being communicated well to readers... here I guess it somewhat relies on the authors of individual papers to explain the meaning of the metric in their supplemental to reduce confusion on the issue.
Thanks for your thoughts on the issues. (also, you can close it)
Hey Alex. This issue of the interesting genes being the contamination is a concern. After all, contaminating genes can often seem interesting since it is out of the expected phylogenetic context. I agree the AAI threshold is very helpful in this regard and is the rational for having some measure of strain heterogeneity.
Not to be too self promoting, but you might be interested in my tool RefineM (https://github.com/dparks1134/RefineM) and in particular the plots that can be created with the "outliers" command. You can see an example of how we used this in the Supp. Material of our manuscript at http://science.sciencemag.org/content/350/6259/434 were we are looking at methanogenesis in some novel MAGs. Figs. S3 and S4 plot the location of the contigs containing methane metabolism-associated genes to illustrate that these contigs are "well centered" in the genomes with respect to GC, tetranucleotide signatures, and coverage.
Hi Donovan, Just a quick related question that I didn't want to open a new issue for. If an SCG is split between two contigs, called as a gene on both, and both halves hit the HMM, would checkM count this as a duplicate? Or, does it check for those cases by examining which part of the alignment each half hit (and realizing that they do not actually overlap)
Thanks! Alex
It will sanity check for this sort of case when it occurs between two adjacent genes on a contig. This does occur somewhat frequently due to frame shift and/or gene calling errors. CheckM doesn't explicitly check for this when it occurs at the ends of two different contigs.
Hi Donovan, Thanks for the explanation. Just one more (I swear! thanks for bearing with me) - could two gene (or marker sets) from the same contig contribute to contamination and completeness if one is a duplicate? For example, if gene A on one end of a contig is a duplicate of some SCG elsewhere in the bin, but gene B on the other end is not a duplicate, would gene B contribute to the completeness while gene A contributes to the contamination?
Not sure I follow you completely. Any gene identified multiple times will contribute to the contamination estimate regardless of where the genes are located. The only exception to this is if they are adjacent genes on a contig and span different portions of the gene. In this case, I assume there was a frameshift, gene calling, or valid gene duplication since this being contamination is highly unlikely. Otherwise, the same gene appearing twice on the same contig will contribute to contamination. This might be valid gene duplication, but this is unclear without more context (i.e., several closely related genomes all having this gene multiple times). In all cases, this gene will also count towards the completeness estimate since it was identified at least one time.
I think you followed what I was getting at. Do you encounter a lot of "chimeric" contigs in the metagenomic systems you work with? If contigs aren't chimeric, I don't think it makes sense for genes on one contig to contribute to both contamination or completeness - either all of the genes on the contig are part of the intended genome, or they are not.
It's very not-obvious to me how to tell whether any given contig contributes to either contamination or completeness, but feels like a very suboptimal situation where you could pass one complete circular contig and obtain a "contamination" estimate - what would the contaminant even be?
The quality of the contigs will depend on the assembler, scaffolding method, and community complexity. We have observed chimeric contigs, but these are hard to identify in general. I don't really know how common chimeric contigs are. There is a rich literature of this, but most studies don't consider assemblies of metagenomic data and assemblers change quickly so it is hard to get a good sense of common issues.
If you can argue that certain duplicate genes are in fact real gene duplication, you can exclude these from the estimates provided by CheckM (see the options for the 'qa' command).
Hi Donovan,
Currently checkM reports "contamination" as a percentage. However, I think intuitively people think a contamination percentage should reflect (1) "the percentage of contigs that are possible contamination". However, I believe the current percentage reflects (2) "the percentage of SCGs that are duplicates". Note that there is a difference between these two metrics - if you combined two genomes and measured the "contamination" of that combination, the first would return 50% (50% of the genes in that combination are from another genome), while the second would return 100% (100% of the genes are duplicated).
Wouldn't it be more intuitive and interpretable for CheckM to report (1) and not (2)? I honestly believe a lot of people read these values in the literature and erroneously assume that they mean (1). Curious about your thoughts on this.
Thanks for your time, Alex