danlwarren / RWTY

R We There Yet?
30 stars 14 forks source link

makeplot.asdsf wrong values #39

Closed geneva closed 9 years ago

geneva commented 9 years ago

This only affects the develop branch but the ASDSF values calculated by this function are wonky. ASDSF values calculated elsewhere are unaffected. Working on a fix now.

geneva commented 9 years ago

It turns out this is more of an issue than I initially thought, and I need some advice on how to proceed...

The makeplot.asdsf function is calculating ASDSF correctly but the values returned are heavily dependent on the number of trees in each sliding window. I have been testing RWTY on a set of chains that took a long time to converge. A plot of ASDSF from the values reported by MrBayes looks like this:  image

This analysis ran for ~70 million generations, logging every 10,000 trees. MrBayes calculated ASDSF every 10,000 generations and we observe a slow approach to the arbitrary (but widely used) cutoff value of ASDSF < 0.01. In contrast, ASDSF calculated using makeplot.asdsf results in plots that look very different.

2 windows, min.freq = 0.1 (3500 trees per window)

image

10 windows, min.freq = 0.1 (700 trees per window)

image

50 windows, min.freq = 0.1 (140 trees per window)

image

100 windows, min.freq = 0.1 (70 trees per window)

image

500 windows, min.freq = 0.1 (14 trees per window)

image

Possible cause?

I suspect this is due to the far smaller number of trees used to calculate split frequencies, resulting in imprecise estimate of the frequencies of splits and leading to many splits being disregarded because they do not appear in at least 10% of the trees in any chain for a given window. My suspicion regarding the inaccurate split frequency estimation is supported by the fact that if I adjust min.freq to 0 the estimation of ASDSF more closely approximates the MrBayes result.

10 windows, min.freq = 0 (700 trees per window)

image

The compare.n function (which generates the clustering diagram and ASDSF values in the compare plot) does not suffer from this issue because it calculates ASDSF from all post burnin trees (in most cases thousands of trees).

Since MrBayes is calculating ASDSF from trees that are not all logged I don't think there is a way to calculate ASDSF that is equivalent. So how to we proceed? I can think of a few options (in ascending order of effort required):

  1. Drop the sliding window plot of ASDSF
  2. Continue to include a sliding window plot of ASDSF but include a warning that these values will not match the MrBayes output
  3. Add the ability to read MrBayes .mcmc output files to the load.trees function. These log files include the MrBayes calculation of ASDSF (the last column of any .mcmc file_ that uses unlogged trees.
roblanf commented 9 years ago

Hey @geneva,

I'm a bit confused here. A priori I wouldn't expect sliding window plots of ASDSF to look like the MrBayes plots. From memory, MrBayes calculates ASDSF from the last 75% of trees in the chain (roughly assuming that the first 25% is always burnin). So, we'd expect to see something similar in a cumulative ASDSF plot (after removing burnin), but not identical.

More broadly, I kind of agree that sliding window ASDSF is not very meaningful. I think your understanding of the problem looks right, and for that reason I'd vote to ditch the sliding window ASDSF plots.

geneva commented 9 years ago

I had always thought MrBayes was calculating ASDSF in a sliding windows and without discarding any trees as burnin. The fact that MB is using cumulative frequencies with some fixed percent of generations discarded as burnin explains the wide discrepancy between the sliding widow plots and MB output.

I am in agreement, lets discard the sliding window ASDSF plots. I will convert makeplot.asdsf to cumulative plot and we can see if that adds information/clarity relative to the already implemented variance plots.

roblanf commented 9 years ago

I (sadly, literally) woke up in the night last night and realised that there is a big difference between the plots I implemented, and the ASDSF plots. The plots I implemented are about a single chain. ASDSF are comparisons of >1 chain. So, for that reason we should keep both. However, if it's clearer we could always make the plots I made into line plots of e.g. the median or mean variance at each point.

I assume (though I haven't checked) that one can do the calculations for ASDSF across any number of chains. Is that what you've got working @geneva?

Another option might be to have a second compare.n plot with pairwise ASDSF plots, this might be useful for spotting cases where a single chain explores different bits of treespace from the rest.

danlwarren commented 9 years ago

Aha! That makes perfect sense then. Rob, I actually prefer the way you're making the variance plots. If it's not too much trouble, I think it would actually be nicer to do the ASDSF multi-chain plots the same way.

geneva commented 9 years ago

@roblanf sorry, I missed your comment yesterday. To answer your question, yes, the sliding window function makeplot.asdsf on the develop branch that I used to generate the sliding ASDSF examples above calculates ASDSF on any number of chains (with a minimum of two). I am still futzing with turning that into a cumulative plot, the hold-up is avoiding the need to repeatedly calculate split frequencies, which slows everything way down – maybe when split frequencies are calculated using faster code this will no longer be an issue.

@danlwarren - just so I understand, you are saying you'd like to see the ASDSF plot aesthetically match the variance plots? If so, no problem. The rough version I am working on creates plots like those in my 2nd comment above.

The other function I have added ASDSF to is compare.n on the master fork. This function generates the same two plots as before but now reports pairwise ASDSF in the matrix of compare plots (ASDSF is calculated on split frequencies for all post burnin trees). The second figure –  the chain clustering diagram –  uses pairwise ASDSF to generate the dendrogram of chain similiarity.

danlwarren commented 9 years ago

That was what I meant, yes. I haven't looked at how you're calculating your stuff, though, so maybe there isn't enough information to get the full distribution of split frequency SDs?

IIRC Rob and I dealt with the issue of repeatedly calculating split frequencies by calculating the cumulative frequency from the sliding window frequencies (i.e., the cumulative frequency at any step is the average of the sliding window frequency for that step and all previous steps). That way you're only doing one pass per window, and you're calculating frequencies over a much smaller number of trees. I think the same would work with the SDs, right? Basically:

  1. Get sliding window table of frequencies for each chain
  2. For sliding window SD plot, just go column by column and calculate SD per clade
  3. For cumulative SD plot, calculate posterior for a clade at each step by taking the average of that step and the columns to the left of that step, then calculate the SD of those from each chain

Would that work?

Obviously the most efficient thing to do would be to retain the tables from the posterior plots, but with the way we've got things set up now that's not trivial and would make the code much uglier.

geneva commented 9 years ago

Ok, got it. It shouldn't be much trouble to build the plots in the style of the variance plots.

The steps you lay out match what I have been working on. The recalculating split frequencies problem only arises if I try to match the MrBayes approach to ASDSF (cumulative plots retaining the last 75% of trees). In that scenario, the set of trees to be retained changes from window to window. I thought about discarding entire windows as burnin: for windows 1-3 simply calculate cumulative splot frequencies - discarding nothing, for windows 4-7 calculate cumulative frequencies discarding window 1, for 8-11 discard windows 1-2, etc...

I'll try to finish a version that matches the style of the variance plots and does not perform any burnin by the end of the night tonight or first thing tomorrow. From there we can decide if it is worth trying to incorporate burnin into this function.

danlwarren commented 9 years ago

Sounds awesome!

geneva commented 9 years ago

The new cumulative ASDSF plot is much closer to what I was expecting from a ASDSF plot. Let me know what you guys think.

Example: image

And here are the same chains from the MrBayes ASDSF output: image

danlwarren commented 9 years ago

Oh that looks awesome!