AstrobioMike / GToTree

A user-friendly workflow for phylogenomics
GNU General Public License v3.0
192 stars 25 forks source link

Cleaning Trees with TreeShrink #54

Closed joshuakirsch closed 2 years ago

joshuakirsch commented 2 years ago

Hi,

Sorry I'm sending so many messages recently. The trees produced by this program are great, but there are obvious outliers which detract from the overall idea of the trees. I used this program: to detect and remove outlier leafs. My question to you is should I rebuild the tree from scratch by leaving out the known outlier genomes or is the tree file provided from TreeShrink usable?

Thanks!

AstrobioMike commented 2 years ago

Hey there, Joshua!

Don't be sorry at all! :)

I think it depends on the level we are operating on and if we are trying to say something highly specific or not. For example, when I make a tree that's a broad-level, like something spanning many phyla or a whole domain, when we typically have hundreds to thousands of genomes in there, the individual placement of any individual taxa isn't usually that important (other than any new ones we might be adding to the tree to see where they are – but that's a different story). In cases like that, there are indeed always a few that are obviously out of place or just have really long branches that almost certainly are due to the source-genome's assembly being poor quality and/or contaminated. And I have no hesitation in literally just trimming them out in a tree visualization program. So in the same vein, in my opinion it's totally fine to just take the tree file output from TreeShrink – which i didn't know about by the way, seems like a great way to do this systematically, thanks!

If it's a smaller tree, covering less diversity and only a few dozen sequences (there's no real numeric cutoff here), then we are getting closer to the realm where any given sequence may have a larger impact on the tree topology as a whole, and I'd be more inclined to remove them from the input list and then remake the tree. (Incidentally, making it easier to do this and iterate phylogenomic trees as easily as single-gene trees is part of what made me put GToTree together in the first place, ha).

I'm pretty sure you were talking about a scenario like the first one i describe, with a large tree where I think it's totally fine. But I just wanted to respond with my thoughts on both ends of the spectrum :)