Ecogenomics / CheckM

Assess the quality of microbial genomes recovered from isolates, single cells, and metagenomes
https://ecogenomics.github.io/CheckM/
GNU General Public License v3.0
332 stars 73 forks source link

Contamination #165

Closed quliping closed 5 years ago

quliping commented 6 years ago

This is the first time I've ever used the software. So I don't know what do these numbers in the "Contamination" column mean? For example, bin 6 , does that mean the contamination is 2154.89%? Why is it more than 100%? Or, are these figures not percentages? qq 20180809112827

donovan-h-parks commented 6 years ago

Hello. The number is a percentage. A value >100% means the single copy marker genes were identified many times. On the first row, you can see that 56 of the marker genes were identified >=5 times. As such, the estimated contamination is >100%. It would appear this bin is actually a mix of 5 or more genomes. In fact, the ~2000% contamination suggests there are around 20 organisms worth of marker genes in this bin (these might be 20 distinct genomes or a odd mixture of several dozen partial genomes).

Devadas07 commented 5 years ago

what conclusion can draw from these results Capture

donovan-h-parks commented 5 years ago

I'd conclude all these genomes are close to complete with little to no contamination.

Devadas07 commented 5 years ago

thanks for ur quick response dparks one of my senior asked me to remove these two genomes synechoccus_sp.JA_3_3Ab and synechoccus_sp.JA_2_3Ba_2_13 from my genomes list for firther studies and he says these two are contaminated and may i know what is the difference between P_Cyanobateria (UID2143) and K_bacteria (UID1453). Capture2

donovan-h-parks commented 5 years ago

UID2143 and UID1453 indicate the node in the CheckM reference tree which was used to establish the set of marker genes used to evaluate the quality of the genome. The kBacteria indicates this node is above any named phyla. The pCyanobacteria indicates it is above any named Cyanobacteria class. The exact number of marker genes in the marker sets is given in the # markers column. These genes are organized into co-located marker sets (see the CheckM manuscript for details). I'm not sure the exact nature of your project, but these results suggest the synechoccus_sp.JA_3_3Ab and synechoccus_sp.JA_2_3Ba_2_13 genomes are 100% complete with little to no contamination.

Devadas07 commented 5 years ago

thanks for ur response dparks little contamination means should i take this contamination into consideration or not. what are the upper and lower limits of contaminations. and can u suuggest me how sholud i remove these contaminations.

donovan-h-parks commented 5 years ago

Not sure I can provide much insight here. It depends on both your application and the source of the genomes. If these are isolate genomes, low levels (say <2%) of reported contamination is likely just inaccuracies in the CheckM estimates. If these are metagenome-assembled genomes, the situation is unclear. You can look to use MAGpurify, RefineM, or ACDC to remove contamination.

Devadas07 commented 5 years ago

thanks, u so much for this valuable information sir. I am running checkm to find % of contamination for further analysis of cyanobacterial genomes. if any contamination found I want to discard for further analysis. actually, I have taken these genomes from NCBI FTP site. I am working on comparative studies of cyanobacterial species for pan-genome analysis.

DongYuan5177 commented 1 year ago

Hello! what conclusion can draw from my results? checkM