FePhyFoFum / phyx

phylogenetics tools for linux (and other mostly posix compliant) computers
blackrim.org
GNU General Public License v3.0
111 stars 17 forks source link

prune low support branches and/or remove polytomies #140

Closed stubbsrl closed 3 years ago

stubbsrl commented 3 years ago

Would it be possible to add a feature to pxrmt that would allow me to prune nodes (and descendants) of low support rather than leaf names? That is, remove all branches and descendants on branches with less than, say, 50 BS.

Another way I was thinking about this is if I input a tree that already had branches of low support collapsed (by newick utils for instance) if I could then use pxrmt to prune all polytomies?

Either option would help. The idea would be to clean up some gene trees. The downstream analyses I want these gene trees for don't 1) accept polytomies 2) nor take into account support values.

Thanks, Rebecca

josephwb commented 3 years ago

I like this idea, and have thought about this from time to time. I think I even have some proto-code laying around somewhere...

We need to be absolutely explicit with what is expected when polytomies reside deep in the tree (i.e., near the root) but has well-resolved descendant clades; surely not all of this should be pruned?

I will think about this a little. Thanks for bringing it up.

stubbsrl commented 3 years ago

Thanks for the quick response!

One of the things I would like to do with these cleaned gene trees is tally up triplets. Right now I'm getting wonky tallies that seem to go beyond ILS and gene flow, and my guess is that this is because I'm counting relationships that aren't actually statistically supported.

In regards to your point about how to deal with a polytomy in the backbone I'm working with thousands and thousands of gene trees, so if there is a polytomy deeply nested, and all tips are removed, I would just discard the tree. This works for me because 1) I have an excess of trees and 2) I'm working at a shallow evolutionary scale with only a handful of species. But, if someone is working at a deeper evolutionary scale, and/or with less genes, then removing an entire tree is less than ideal.

I wrote a little layperson python script that that finds polytomies and then selects one tip from the polytomy at random and deletes the rest (the problem is my script only works at the tips, not at deeper nodes). So, perhaps leaving one tip in a polytomy is a plausible option.

Rebecca

josephwb commented 3 years ago

Could you supply a file containing a few newicks with support values, as well as expected output (ideally expected newicks, but verbal works too)?

In the 'keep one in' approach, is there a concern that a biased representative tip might be selected? Let's say you have 3 gene trees with the same topology as the one below, where F (or G) is as equally related to ((A,B),(C,D)) as is E. If a representative tip is chosen, say, alphabetically, then E will make it into more downstream analyses than F (or G). Is that a problem for the questions/methods you are pursuing?

Just trying to figure out the expected results.

                                                _____________ A
                               ________________|
                              |                |_________ B
      ________________________|
     |                        |                         ___________________ C
     |                        |________________________|
  ___|                                                 |________________________ D
 |   |
 |   |                 ________________________ E
 |   |                |
 |   |________________|___________ F
 |                    |
_|                    |________ G
 | 
 |___________________ H
stubbsrl commented 3 years ago

Yes, I started to write about the potential issues with the select one approach but then my post got way too long. In the triplets approach the empirical trees are compared against simulated trees which would have the same missing taxa (see Cai et al. 2020 Sys Bio) so that shouldn't be a problem. If the selected taxa could be truly random I think it would still be okay in network analyses (e.g., SNaQ). So, alphabetical wouldn't work. I was imagining something like where polyt_list is a list of the tips in the polytomy-> i=rand.int(0,len(polyt_list)) and then replace the polytomy with polyt_list[i].

I have attached three sets of four real trees. One is the original, two has low support branches collapsed into polytomies, three is a potential end result. The end result is the simplest solution where all species in polytomy are deleted but one. There are other more complicated solutions such as in the second tree where instead of leaving one lone species the clade with four fully supported tips could have been selected in the polytomy.

R sample_trees.zip

josephwb commented 3 years ago

@stubbsrl fyi there is an available phyx program pxcolt which collapses internal edges below some threshold. Currently it only supports support values in [0,1], i.e., not integer values, but that is easy to fix. This seems to be a decent starting point to your goal.

EDIT: integer support values are fine, but the threshold itself has to be reported as a proportion; i.e., if you want to set o threshold of 70, do -l 0.7. Maybe this is too annoying or not intuitive? I certainly forgot it!

stubbsrl commented 3 years ago

Oh cool thanks! I'll check it out today.

Rebecca

josephwb commented 3 years ago

Hi @stubbsrl. I just want to make sure you are not waiting on me for this. I hope to get to this soon, but am up to my eyeballs in other things at the moment.

stubbsrl commented 3 years ago

Hey @josephwb,

No worries - I gathered this wasn't happening right at this moment. Thanks for the follow up!

Rebecca

josephwb commented 3 years ago

Hi @stubbsrl The example trees you sent over were all rooted. Is that guaranteed? I want to get this to both with both rooted and unrooted trees, but am concentrating on the former for a first pass.

stubbsrl commented 3 years ago

Yes - for most analyses I need the trees rooted so they get rooted by the outgroup pretty early on.

josephwb commented 3 years ago

Ok, I got good progress on this today. Still dealing with edge cases; sometimes getting unanticipated results with nested polytomies.

To be clear, with this sample polytomies until you are left with a binary tree approach there is the possibility of losing large swathes of a tree. The way things work now (not yet pushed) is that 2 random lineages of every polytomy are selected to remain (the rest, purged). If you have, say, a well resolved subtree, but it is descended from a polytomy, there is a chance that it's stem edge will not be randomly selected, and thus hits the cutting rom floor. I have seen this in test cases.

Yeah, so this is not "intelligent" sampling (i.e., keeping as many well-resolved clades as possible), but it is "fair" in that all child lineages of a polytomy are given equal weight. I might be possible to something "intelligent", but that 1) would take a ton more logic to work through, and 2) it wouldn't really be "random".

Anyway, some stuff to think about.

stubbsrl commented 3 years ago

Yes, this loss of an entire tree due to low backbone resolution is a possibility. I have thought about this and to not lose too many trees I would use a fairly low cut-off for collapsing nodes. From there, I guess having trees to spare would also be helpful. It does seem a lot of data sets are moving in this direction.

I hadn't thought about "intelligent" sampling. Sounds cool, but I wonder if it would lead to a strong bias at shallow nodes. For instance, in the sample trees (which are just renamed from my real trees) C and D always form a nice clade, which is sometimes sister to E or F or (E,F). Whenever either E or F plus (C,D) are collapsed into a polytomy, using the keep well-resolved clade logic would always favor C + D and I would lose E and F from far more trees. Is that correct? If so, I think that bias may be problematic.

But, if we are talking about a backbone polytomy subtended by well-resolved clades then intelligent sampling would be the best solution. Thus, I have made this idea even more complicated (i.e., intelligent sampling for deep nodes, random sampling for shallow nodes), so I think random sampling is the way to go.

josephwb commented 3 years ago

hey again @stubbsrl I pushed the polytomy-sampler code. It didn't quite fit with our philosophy of "each program does one thing" to stick it in with pxcolt (support-based edge collapser) so I moved it into its own program pxpoly. (Plus, people may want to sample polytomies on trees with no support values).

To try this out, you will need to do the following:

git pull
make clean
autoreconf -fi
./configure
make

(Your configure options may differ, and you may not want to make everything).

You can run it on the test tree I have been using:

./pxpoly -t TEST/polytomy.tre

You should see different results if you run it multiple times.

Because it is its own separate program, to carry out the pipeline of:

  1. collapse low-support edges
  2. sample resulting polytomies to generate a binary tree you can pipe things. Say you have a file phy.tre. Do:
    ./pxcolt -t phy.tre -l 0.7 | pxpoly

    and of course you can save the result by using the -o option.

I anticipate running into issues with applying to real data, but please be patient and report any problems here.

Erm, I think that is it for now,

stubbsrl commented 3 years ago

Oh this is very exciting! I will test it out on some trees and let you know.

Thanks for getting to this so quick and spending so much time on it!

stubbsrl commented 3 years ago

Well there are some odd things happening.

First a question: The random taxa that is kept in a polytomy is this changing for every tree?

I set a seed for the runs so the errors are reproducible (nice feature btw).

There are two different types of problems I have seen. For one, with seed 12345 the program produces two trees with identical five taxa and then ends with the error:

terminate called after throwing an instance of 'std::out_of_range' what(): vector::_M_range_check: __n (which is 0) >= this->size() (which is 0) Aborted (core dumped)

In the other, if you use seed 54321 the program appears to catch and starts reducing every tree (erroneously) first to the same five taxa, then three taxa, and then to two. If I include more trees it errors out as above in the next few trees.

For both it is at tree 4 that the trouble begins which is the first tree with a collapsed backbone. My guess is that is the issue. The polytomy in tree 1 is successfully reduced with both seeds.

The problem trees are attached.

pxpoly_samples.zip

josephwb commented 3 years ago

Ok, thanks for this; this is very helpful.

Yes, the taxa sampled are random for each tree. As each tree is processed separately, there is no knowledge of what overlap there may be in taxa involved in polytomies across trees. If you do have that information, you can use pxrmt to specifically remove those terminals.

Looking at tree 4 now I see the result should be a 3-lineage tree (OG, and 2 from the ingroup, which could be individual tips or clades themselves) because of the location of the polytomy. I will check why it is erroring.

Hrm those last 3 trees do not contain any polytomies, so they should exit unscathed. I assume you have already collapsed poorly-supported edges (with maybe a threshold of 50%)?

OH CRAP I KNOW WHAT IS HAPPENING I WILL FIX IT RIGHT AWAY!

josephwb commented 3 years ago

hey @stubbsrl I just pushed a fix. Can you please pull and compile again to check? It runs correctly on those trees above (or, at least it does not die a horrible death).

stubbsrl commented 3 years ago

Great! It’s after 6pm here and I’ve left my house but I’ll get to this ASAP (which may not be until tomorrow I’m sorry!) and let you know.

On Apr 1, 2021, at 5:53 PM, Joseph W. Brown @.***> wrote:

 hey @stubbsrl I just pushed a fix. Can you please pull and compile again to check? It runs correctly on those trees above (or, at least it does not die a horrible death).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

stubbsrl commented 3 years ago

The program just ran flawlessly on ~1600 trees with 20 tips, all with varying amounts of troubled polytomies. It was glorious.

I ran the program on a set of trees with 160 tips and the order of operations proved to be important but otherwise still worked. Maybe worth adding to the -h? 1) pxcolt to collapse low support edges, 2) pxrr to root with all outgroups, 3) pxpoly to reduce polytomies. If I tried to root first and then collapse there are issues with pxpoly saying the trees aren't rooted (which makes sense if collapsing changed something with the root), and if I tried to root again in response to that error it core dumps.

Thanks for all your work!

josephwb commented 3 years ago

So glad it worked. I was afraid there would be remaining edge cases where things failed (there undoubtedly are). Rooted trees with a polytomy at the root are tricky because they look like unrooted trees.

The "order of operations" bit makes sense, i.e., it is expected behaviour. If you publish with this please include the ordering in case someone wants to replicate it. Since you noted you were dealing exclusively with rooted trees I only focused on them. I can go back now and consider the unrooted case, although I believe you would still require the 3 steps.

BTW, adding:

"It was glorious."

to the README (^_-)≡☆