Ecogenomics / GTDBTk

GTDB-Tk: a toolkit for assigning objective taxonomic classifications to bacterial and archaeal genomes.
https://ecogenomics.github.io/GTDBTk/
GNU General Public License v3.0
464 stars 82 forks source link

"closest_placement_reference" is missing #464

Closed mars188 closed 1 year ago

mars188 commented 1 year ago

Hello,

I have 40 bacterial MAGs that I used to create a tree with "gtdbtk classify_wf" option. Here is the command I used.

gtdbtk classify_wf --genome_dir selected_genomes --extension fa --out_dir gtdb --cpus 24

GTDB-tk placed most of my MAGs to their respective reference genomes, however, closest_placement_reference is missing for few of my MAGs. How can i find them? Or do I need to run it differently?

Many thanks,

pchaumeil commented 1 year ago

Hi, The field closest_placement_reference is filled when the user genome is placed on a terminal branch leading to a representative genome ( this representative genome will be considered the closest reference genome based on the tree topology). If the pplacer places the genome 'high' enough in the tree ( above multiple GTDB reference genomes ) GTDB-Tk does not report the closest_placement_reference value. Hope this helps

zwets commented 1 year ago

@pchaumeil what interpretation (if any) could be given to this, and to incongruent placement in general? Does it suggest that the local tree topology may need revision?

I manage an interesting genome collection: 3,300+ isolates from various (cultured bacteria) sequencing projects in East Africa, which (of course!) I ran through GTDB-Tk. Now I'm struggling with what to make of the fact that 58 of my 64 _EnterobacterA hormaechei have incongruent placement.

About half are pplaced at Enterobacter quasihormaechei, the other half have no pplacer placement. Of the latter, 6 aren't hormaechei according to KmerFinder, they are cloacae. Likewise for my six _E cloacaeM: four have no pplacer placement - and are called E asburiae by KmerFinder.

To complete the picture: I also have a cluster of 3 novel Enterobacter sp. (by _de_novowf), as well as alleged E asburiae and E soli that fall outside GTDB ANI radii.

Are these discrepancies due to us sequencing strains that haven't (much) been seen before, or is this just the general (phylo)genomic messiness of the enterobacteriaceae?

zwets commented 1 year ago

Maybe this :point_up: should move to the forum?

pchaumeil commented 1 year ago

I think you mean Enterobacter hormaechei_A, not Enterobacter_A hormaechei? E. hormaechi, E. hormaechi_A, and E. quasihormaechei are tightly clustered in the reference tree and you can get disagreement between phylogenetic (tree) and similarity (ANI, kmers)-based methods in these circumstances. Trees are the most reliable approach as they are based on an evolutionary model, but you can get artifacts if you create a de novo tree from thousands of closely related strains. Sorry that I don't have a more definitive answer.