Open taylorreiter opened 4 years ago
can we identify the genes or contigs that checkm says are contaminated?
I'm not sure...that would be super helpful. Will dig!
It looks like we can with the qa
subcommand:
https://github.com/Ecogenomics/CheckM/issues/68#issuecomment-226849895
QA command documentation: https://github.com/Ecogenomics/CheckM/wiki/Genome-Quality-Commands#qa
This is also helpful/good to keep in mind: https://github.com/Ecogenomics/CheckM/issues/127#issuecomment-350887596
CheckM contamination results should be viewed in the context of the estimated completeness. I find this natural since a 50% complete genome with 10% contamination means you have about half of the genome you are after, but 10% of some other genome (or mix of genomes). This is a lot of contamination given that you only have 50% of the target genome.
Both charcoal and Checkm provide contamination estimates. Running charcoal on 2000 almeida MAGs, we see:
which has an R^2 of 0.47 (47%). I'm not super concerned at the moment that we have different contamination estimates, esp. since in the checkM paper they state
However, I would like to understand why there are so many genomes (465 of 2000) where charcoal estimates 0% contamination while checkm estimates >0% contamination. Is this a database limitation (#81)? Also relates to #36.
Some stats for the genomes where checkm >0% and charcoal is 0%: average missed_n = 0.7483871 average missed_bp = 2270.269 average f_ident = 0.7300365 average f_major = 0.978535
(this was all at the genus level)
On the plus side, there are 252 of 2000 genomes where both checkm and charcoal estimate there is no contamination