evaluating charcoal vs. checkm

taylorreiter commented 4 years ago

Both charcoal and Checkm provide contamination estimates. Running charcoal on 2000 almeida MAGs, we see:

almeida_mags_1_charcoal_v_checkm

which has an R^2 of 0.47 (47%). I'm not super concerned at the moment that we have different contamination estimates, esp. since in the checkM paper they state

Bias in genome quality estimates Quality estimates based on individual marker genes or collocated marker sets exhibit a bias resulting in completeness being overestimated and contamination being underestimated (Figs. 1, 4). This bias is the result of marker genes residing on foreign DNA that are otherwise absent in a genome being mistakenly interpreted as an indication of increased completeness as opposed to contamination.

However, I would like to understand why there are so many genomes (465 of 2000) where charcoal estimates 0% contamination while checkm estimates >0% contamination. Is this a database limitation (#81)? Also relates to #36.

Some stats for the genomes where checkm >0% and charcoal is 0%: average missed_n = 0.7483871 average missed_bp = 2270.269 average f_ident = 0.7300365 average f_major = 0.978535

(this was all at the genus level)

On the plus side, there are 252 of 2000 genomes where both checkm and charcoal estimate there is no contamination

ctb commented 4 years ago

can we identify the genes or contigs that checkm says are contaminated?

taylorreiter commented 4 years ago

I'm not sure...that would be super helpful. Will dig!

taylorreiter commented 4 years ago

It looks like we can with the qa subcommand: https://github.com/Ecogenomics/CheckM/issues/68#issuecomment-226849895

QA command documentation: https://github.com/Ecogenomics/CheckM/wiki/Genome-Quality-Commands#qa

This is also helpful/good to keep in mind: https://github.com/Ecogenomics/CheckM/issues/127#issuecomment-350887596

CheckM contamination results should be viewed in the context of the estimated completeness. I find this natural since a 50% complete genome with 10% contamination means you have about half of the genome you are after, but 10% of some other genome (or mix of genomes). This is a lot of contamination given that you only have 50% of the target genome.

dib-lab / charcoal

evaluating charcoal vs. checkm #86