Closed snayfach closed 6 years ago
Hey Stephen,
The UID numbers indicate the node in the tree for which a marker set was established. Multiple nodes are labelled Firmicutes as they fall above any defined class, but are within this phylum. A given UID may not have identical marker sets since marker sets are also refined for expected legitimate gene duplication and loss, and this depends on the exact placement of a genome in the reference tree.
The taxon set command defines marker sets using the NCBI taxonomy. The NCBI taxonomy is not always congruent with the CheckM reference tree so marker sets can differ. The taxon set command also doesn't do any refinement for expected gene duplication and loss.
Cheers, Donovan
Thanks, this all makes sense. But what do you mean when you say "the taxon set command doesn't do any refinement for expected gene duplication and loss"?
When using the "lineage_wf" the specific position of a genome in the GTDB reference tree is used to infer likely gene loss or duplication. Such genes are then removed from the set of marker genes for each genome independently. This approach is not done with the "taxonomy_wf" since the position of a genome in the reference tree is never calculated.
Got it - thanks for the response
Hey Donovan - I have a few quick questions for you:
Why are there different marker gene sets for a given rank and clade:
Also, why do the # of markers differ within a gene set:
Finally, if I run checkm using markers generating using
checkm taxon_set
, the #s differ yet again from those produced by the lineage-specific workflow:Thanks for your clarifications.
Best, Stephen