Closed mars188 closed 1 year ago
Hi,
The field closest_placement_reference
is filled when the user genome is placed on a terminal branch leading to a representative genome ( this representative genome will be considered the closest reference genome based on the tree topology).
If the pplacer places the genome 'high' enough in the tree ( above multiple GTDB reference genomes ) GTDB-Tk does not report the closest_placement_reference
value.
Hope this helps
@pchaumeil what interpretation (if any) could be given to this, and to incongruent placement in general? Does it suggest that the local tree topology may need revision?
I manage an interesting genome collection: 3,300+ isolates from various (cultured bacteria) sequencing projects in East Africa, which (of course!) I ran through GTDB-Tk. Now I'm struggling with what to make of the fact that 58 of my 64 _EnterobacterA hormaechei have incongruent placement.
About half are pplaced at Enterobacter quasihormaechei, the other half have no pplacer placement. Of the latter, 6 aren't hormaechei according to KmerFinder, they are cloacae. Likewise for my six _E cloacaeM: four have no pplacer placement - and are called E asburiae by KmerFinder.
To complete the picture: I also have a cluster of 3 novel Enterobacter sp. (by _de_novowf), as well as alleged E asburiae and E soli that fall outside GTDB ANI radii.
Are these discrepancies due to us sequencing strains that haven't (much) been seen before, or is this just the general (phylo)genomic messiness of the enterobacteriaceae?
Maybe this :point_up: should move to the forum?
I think you mean Enterobacter hormaechei_A, not Enterobacter_A hormaechei? E. hormaechi, E. hormaechi_A, and E. quasihormaechei are tightly clustered in the reference tree and you can get disagreement between phylogenetic (tree) and similarity (ANI, kmers)-based methods in these circumstances. Trees are the most reliable approach as they are based on an evolutionary model, but you can get artifacts if you create a de novo tree from thousands of closely related strains. Sorry that I don't have a more definitive answer.
Hello,
I have 40 bacterial MAGs that I used to create a tree with "gtdbtk classify_wf" option. Here is the command I used.
gtdbtk classify_wf --genome_dir selected_genomes --extension fa --out_dir gtdb --cpus 24
GTDB-tk placed most of my MAGs to their respective reference genomes, however, closest_placement_reference is missing for few of my MAGs. How can i find them? Or do I need to run it differently?
Many thanks,