Closed geneva closed 9 years ago
It turns out this is more of an issue than I initially thought, and I need some advice on how to proceed...
The makeplot.asdsf
function is calculating ASDSF correctly but the values returned are heavily dependent on the number of trees in each sliding window. I have been testing RWTY on a set of chains that took a long time to converge. A plot of ASDSF from the values reported by MrBayes looks like this:

This analysis ran for ~70 million generations, logging every 10,000 trees. MrBayes calculated ASDSF every 10,000 generations and we observe a slow approach to the arbitrary (but widely used) cutoff value of ASDSF < 0.01. In contrast, ASDSF calculated using makeplot.asdsf
results in plots that look very different.
I suspect this is due to the far smaller number of trees used to calculate split frequencies, resulting in imprecise estimate of the frequencies of splits and leading to many splits being disregarded because they do not appear in at least 10% of the trees in any chain for a given window. My suspicion regarding the inaccurate split frequency estimation is supported by the fact that if I adjust min.freq
to 0 the estimation of ASDSF more closely approximates the MrBayes result.
The compare.n
function (which generates the clustering diagram and ASDSF values in the compare plot) does not suffer from this issue because it calculates ASDSF from all post burnin trees (in most cases thousands of trees).
Since MrBayes is calculating ASDSF from trees that are not all logged I don't think there is a way to calculate ASDSF that is equivalent. So how to we proceed? I can think of a few options (in ascending order of effort required):
load.trees
function. These log files include the MrBayes calculation of ASDSF (the last column of any .mcmc file_ that uses unlogged trees.Hey @geneva,
I'm a bit confused here. A priori I wouldn't expect sliding window plots of ASDSF to look like the MrBayes plots. From memory, MrBayes calculates ASDSF from the last 75% of trees in the chain (roughly assuming that the first 25% is always burnin). So, we'd expect to see something similar in a cumulative ASDSF plot (after removing burnin), but not identical.
More broadly, I kind of agree that sliding window ASDSF is not very meaningful. I think your understanding of the problem looks right, and for that reason I'd vote to ditch the sliding window ASDSF plots.
I had always thought MrBayes was calculating ASDSF in a sliding windows and without discarding any trees as burnin. The fact that MB is using cumulative frequencies with some fixed percent of generations discarded as burnin explains the wide discrepancy between the sliding widow plots and MB output.
I am in agreement, lets discard the sliding window ASDSF plots. I will convert makeplot.asdsf to cumulative plot and we can see if that adds information/clarity relative to the already implemented variance plots.
I (sadly, literally) woke up in the night last night and realised that there is a big difference between the plots I implemented, and the ASDSF plots. The plots I implemented are about a single chain. ASDSF are comparisons of >1 chain. So, for that reason we should keep both. However, if it's clearer we could always make the plots I made into line plots of e.g. the median or mean variance at each point.
I assume (though I haven't checked) that one can do the calculations for ASDSF across any number of chains. Is that what you've got working @geneva?
Another option might be to have a second compare.n plot with pairwise ASDSF plots, this might be useful for spotting cases where a single chain explores different bits of treespace from the rest.
Aha! That makes perfect sense then. Rob, I actually prefer the way you're making the variance plots. If it's not too much trouble, I think it would actually be nicer to do the ASDSF multi-chain plots the same way.
@roblanf sorry, I missed your comment yesterday. To answer your question, yes, the sliding window function makeplot.asdsf
on the develop branch that I used to generate the sliding ASDSF examples above calculates ASDSF on any number of chains (with a minimum of two). I am still futzing with turning that into a cumulative plot, the hold-up is avoiding the need to repeatedly calculate split frequencies, which slows everything way down – maybe when split frequencies are calculated using faster code this will no longer be an issue.
@danlwarren - just so I understand, you are saying you'd like to see the ASDSF plot aesthetically match the variance plots? If so, no problem. The rough version I am working on creates plots like those in my 2nd comment above.
The other function I have added ASDSF to is compare.n
on the master fork. This function generates the same two plots as before but now reports pairwise ASDSF in the matrix of compare plots (ASDSF is calculated on split frequencies for all post burnin trees). The second figure – the chain clustering diagram – uses pairwise ASDSF to generate the dendrogram of chain similiarity.
That was what I meant, yes. I haven't looked at how you're calculating your stuff, though, so maybe there isn't enough information to get the full distribution of split frequency SDs?
IIRC Rob and I dealt with the issue of repeatedly calculating split frequencies by calculating the cumulative frequency from the sliding window frequencies (i.e., the cumulative frequency at any step is the average of the sliding window frequency for that step and all previous steps). That way you're only doing one pass per window, and you're calculating frequencies over a much smaller number of trees. I think the same would work with the SDs, right? Basically:
Would that work?
Obviously the most efficient thing to do would be to retain the tables from the posterior plots, but with the way we've got things set up now that's not trivial and would make the code much uglier.
Ok, got it. It shouldn't be much trouble to build the plots in the style of the variance plots.
The steps you lay out match what I have been working on. The recalculating split frequencies problem only arises if I try to match the MrBayes approach to ASDSF (cumulative plots retaining the last 75% of trees). In that scenario, the set of trees to be retained changes from window to window. I thought about discarding entire windows as burnin: for windows 1-3 simply calculate cumulative splot frequencies - discarding nothing, for windows 4-7 calculate cumulative frequencies discarding window 1, for 8-11 discard windows 1-2, etc...
I'll try to finish a version that matches the style of the variance plots and does not perform any burnin by the end of the night tonight or first thing tomorrow. From there we can decide if it is worth trying to incorporate burnin into this function.
Sounds awesome!
The new cumulative ASDSF plot is much closer to what I was expecting from a ASDSF plot. Let me know what you guys think.
Example:
And here are the same chains from the MrBayes ASDSF output:
Oh that looks awesome!
This only affects the develop branch but the ASDSF values calculated by this function are wonky. ASDSF values calculated elsewhere are unaffected. Working on a fix now.