danlwarren / RWTY

R We There Yet?
30 stars 14 forks source link

visualising clades in which multiple chains disagree #49

Open roblanf opened 9 years ago

roblanf commented 9 years ago

Here's an idea we just came up with over beers.

One thing that would be useful to know is, given any summary tree, which splits on that tree are most problematic. Obviously the posterior support is one very useful measure of this. But another really useful measure, if you have analysed >1 replicate chain, would be the standard deviation of split frequencies.

So, the obvious idea here is to take a summary tree (the MAP tree would be easy to extract in RWTY), and label each split in that tree with the standard deviation of split frequencies for that split. I can imagine two functions here:

  1. A function that takes any tree, and maps SDSFs onto that tree, and exports it in useful formats (e.g. phylip format with branch labels for the SDSFs).
  2. A function that wraps the function in (1) but does this for the MAP tree.

We might also consider attempting to plot the labelled trees for the output, either with numbers on the branches (e.g. we could have numbers with x/y, where x is the posterior support, and y is the SDSFs), and/or colours or branch thickness representing the SDSFs on that branch.

Back to beers.

NickCrouch commented 9 years ago

I have some code that could address this question, requires a small amount of work. Once I finish it, I would be happy to share if you would like to integrate into the package. Give me a couple days!

danlwarren commented 9 years ago

Cool! Rob and I have already talked through a way to do this that is quite easy given the existing RWTY code, so don't put too much effort into it unless you're just keen to play around with it. Basically the relevant stats are already being calculated in the compare.n function, and all you have to do is get SDSF for the list of clades in your supplied tree and paint nodes accordingly.

NickCrouch commented 9 years ago

If there is already an easy method, I don't think I can improve any. Would the method you have in mind identify specific taxa also?

danlwarren commented 9 years ago

As far as I can recall, we were only talking about internal nodes. Were you thinking of something to identify individual tips that were highly uncertain as well?

NickCrouch commented 9 years ago

This is what I had in mind previously (and have code for) - identify which taxa move around the tree the most, trying to get an idea of the diversity of sister taxa/clades. I wrote this with the intention of working with posterior tree sets, but it could be used to see whether, over the course of a run, taxa "stabilize" to a final location, or continue to move around the tree. Moving is not necessarily a bad thing if there is an an expectation that the taxa is difficult to place, but it could also be indicative of a potential problem with the alignment used

roblanf commented 9 years ago

That's an interesting idea. There's a reasonable literature on detecting so called 'rogue' taxa, that might be useful for identifying tips.

Here are a couple of recent papers that mention detection algorithms:

http://bioinformatics.oxfordjournals.org/content/30/9/1312.short http://sysbio.oxfordjournals.org/content/62/1/162.short

We should probably split this into two separate issues - one for identifying rogue taxa, and one for plotting SDSFs on internal branches.

On 10 August 2015 at 07:54, NickCrouch notifications@github.com wrote:

This is what I had in mind previously - identify which taxa move around the tree the most, trying to get an idea of the diversity of sister taxa/clades. I wrote this with the intention of working with posterior tree sets, but it could be used to see whether, over the course of a run, taxa "stabilize" to a final location, or continue to move around the tree. Moving is not necessarily a bad thing if there is an an expectation that the taxa is difficult to place, but it could also be indicative of a potential problem with the alignment used

— Reply to this email directly or view it on GitHub https://github.com/danlwarren/RWTY/issues/49#issuecomment-129250531.

Rob Lanfear School of Biological Sciences, Macquarie University, Sydney

phone: +61 (0)2 9850 8204

www.robertlanfear.com

NickCrouch commented 9 years ago

The literature on identifying rogue taxa is interesting, but I think that is a subtly different issue. The detection algorithms described in those papers are looking to detect tips that, if removed, will increase the confidence in the consensus trees produced. What I had in mind is much simpler, and just asks whether the taxa you have chosen (for the whole tree) have reached some level of stationarity in terms of their placement in the tree.

I am unsure still of the utility of this measure, but will briefly describe what I did. The method works by subdividing the .t file into sections, and then calculating the number of unique sister taxa for each species within each of the sections. The idea being that a run approaching stationarity will see a decrease in the number of unique taxa per section. I tested this by sampling 1000 phylogenies at equal intervals from a single run from an analysis I did in MrBayes, and plotting the mean number of unique species per section by generation: test_plot The sampling is extremely coarse, this analysis was run for 100 million generations, but it gives some impression for these data.

roblanf commented 9 years ago

Oh OK, I think that's kinda cool. My issue with the measure you have is that taking averages will mask a lot of the interesting signal, and leads to some problems in interpreting the data. E.g. if I include 20 species with identical sequences, my average metric will be really high even if the rest of the tree is completely resolved.

But the fix is very simple - instead of just plotting means, we could summarise the data (e.g. boxplots) or just plot all of it (e.g. jittered partially transparent points). So in this case we get a data point for each taxon.

In addition, I don't think (though correct me if I'm wrong) that there's any particular need to do this multiple times over the whole tree file. Once folks have chosen a burnin, I think it might be fine to just do the measurement once over the post-burnin trees. We could easily produce a nice plot, perhaps with taxa over some threshold labelled (or just give them a data frame of the taxa in rank order of the number of unique sister taxa too).

Finally, a practical question. I can see how you count sister taxa when a taxon is part of a cherry, e.g. with these two trees:

(a, (b, c))
(b, (c, a))

I can see that I count 2 sister taxa for taxon C. But how do I count for e.g. taxon a? A has a sister clade in the first tree, but a sister tip in the second. To illustrate it better, how do i count unique sisters for taxon a in this set of trees:

(a, (b, c))
(a, (b, (c, d)))
(a, (d, (b, c)))
(b, (c, a))

In this case - taxon d jumped in to the sister clade of a. But I don't know if that should be counted as a new unique sister taxon (or clade) because really it might just be that d is going crazy.

What did you do in your code?

NickCrouch commented 9 years ago

Completely agree with you first comment about plotting means, the code is there to do boxplots I just thought with my data the mean was clearer just to highlight what is going on in general.

You can just run it once on all the trees, but you won't see any change over time (if that is informative). The run times would be effectively the same for both running on all at once and in segments, but running both ways would obviously be longer. I think both approaches give the possibility of identifying specific taxa, in general i'm open to suggestions here.

In the example you gave for your last question, 'a' would have 3 distinct sister taxa - the middle two rows would count as 1. If a species is sister to a clade of 2 or more species, as long as the species composition remains the same, then the species within that clade can have any topology without counting as a unique pairing. Does that answer your question?

roblanf commented 9 years ago

Yeah, that answers the question. I think it's kind of a neat approach.

From my perspective, I'm not totally sure about plotting things out over the course of a run (I just can't figure out quite what it would tell you), but I can definitely see the benefit of looking at all the taxa in the tree relative to this metric.

Either way, we already have code to do sliding window and cumulative plots based on various metrics, so I don't think it should be a big deal to make the plots. It's just a case of whether we include them in the package.

I'm thinking that we could do something similar to the posterior probability plots, and just plot out the 20 taxa that are most variable, or something along those lines.

Do you have a function that will calculate the metric on a particular set of trees? If you do, and if you're willing to share it here, that would be a huge help.

On 21 August 2015 at 11:14, NickCrouch notifications@github.com wrote:

Completely agree with you first comment about plotting means, the code is there to do boxplots I just thought with my data the mean was clearer just to highlight what is going on in general.

You can just run it once on all the trees, but you won't see any change over time (if that is informative). The run times would be effectively the same for both running on all at once and in segments, but running both ways would obviously be longer. I think both approaches give the possibility of identifying specific taxa, in general i'm open to suggestions here.

In the example you gave for your last question, 'a' would have 3 distinct sister taxa - the middle two rows would count as 1. If a species is sister to a clade of 2 or more species, as long as the species composition remains the same, then the species within that clade can have any topology without counting as a unique pairing. Does that answer your question?

— Reply to this email directly or view it on GitHub https://github.com/danlwarren/RWTY/issues/49#issuecomment-133234738.

Rob Lanfear School of Biological Sciences, Macquarie University, Sydney

phone: +61 (0)2 9850 8204

www.robertlanfear.com

NickCrouch commented 9 years ago

If you want to perform a single analysis, then the function takes the arguments spp and multi.phy. multi.phy is an object of class multiPhylo, and can be from anywhere you want, a .t file, a sub-sampled set of trees etc.. spp is a character vector of length 1, i.e. a single tip. I then use sapply to pass how ever many tips I want to the function:

result <- sapply(all.species, unique_sister_taxa, multi.phy)

This does also mean I end up having to do this after:

result <- as.data.frame(t(result))
class(result[,1]) <- "numeric"
class(result[,2]) <- "numeric"

Which is because how I use the *pply functions I guess

NickCrouch commented 9 years ago

A potential output could be something like this

unique_sister_output

The number of unique taxa is plotted, and all taxa which have a value greater than the 95% confidence limit are listed, along with their value.