Ecogenomics / CheckM

Assess the quality of microbial genomes recovered from isolates, single cells, and metagenomes
https://ecogenomics.github.io/CheckM/
GNU General Public License v3.0
334 stars 73 forks source link

Discrepancies between marker gene sets #161

Closed snayfach closed 6 years ago

snayfach commented 6 years ago

Hey Donovan - I have a few quick questions for you:

Why are there different marker gene sets for a given rank and clade:

p__Firmicutes (UID240)  1318    179 104
p__Firmicutes (UID240)  1318    178 104
p__Firmicutes (UID241)  930 213 118
p__Firmicutes (UID242)  830 244 129
p__Firmicutes (UID1022) 100 295 158
p__Firmicutes (UID1022) 100 292 155

Also, why do the # of markers differ within a gene set:

p__Firmicutes (UID1022) 100 295 158
p__Firmicutes (UID1022) 100 292 155

Finally, if I run checkm using markers generating using checkm taxon_set, the #s differ yet again from those produced by the lineage-specific workflow:

-----------------------------------------------------------------------------------------------------------------------------------------------------------
  Bin Id         Marker lineage   # genomes   # markers   # marker sets   0     1    2   3   4   5+   Completeness   Contamination   Strain heterogeneity
-----------------------------------------------------------------------------------------------------------------------------------------------------------
  ERS608563_33   Firmicutes (1)      **1349**        **172**            99        21   150   1   0   0   0       84.68            1.01               0.00
-----------------------------------------------------------------------------------------------------------------------------------------------------------

Thanks for your clarifications.

Best, Stephen

donovan-h-parks commented 6 years ago

Hey Stephen,

The UID numbers indicate the node in the tree for which a marker set was established. Multiple nodes are labelled Firmicutes as they fall above any defined class, but are within this phylum. A given UID may not have identical marker sets since marker sets are also refined for expected legitimate gene duplication and loss, and this depends on the exact placement of a genome in the reference tree.

The taxon set command defines marker sets using the NCBI taxonomy. The NCBI taxonomy is not always congruent with the CheckM reference tree so marker sets can differ. The taxon set command also doesn't do any refinement for expected gene duplication and loss.

Cheers, Donovan

snayfach commented 6 years ago

Thanks, this all makes sense. But what do you mean when you say "the taxon set command doesn't do any refinement for expected gene duplication and loss"?

donovan-h-parks commented 6 years ago

When using the "lineage_wf" the specific position of a genome in the GTDB reference tree is used to infer likely gene loss or duplication. Such genes are then removed from the set of marker genes for each genome independently. This approach is not done with the "taxonomy_wf" since the position of a genome in the reference tree is never calculated.

snayfach commented 6 years ago

Got it - thanks for the response