danlwarren / RWTY

R We There Yet?
30 stars 14 forks source link

Incorporating alternatives to Robinson-Foulds / path distances #169

Open ms609 opened 4 years ago

ms609 commented 4 years ago

Hi, I've recently been interrogating the RF distance and established that it is not an excellent measure of tree distance, leading me to propose some alternatives (paper, R implementation). Would you be open incorporating these alternative measures as alternatives to the RF in your functions to calculate tree space? I'd be happy to work on some pull requests if this would help. Cheers, Martin

roblanf commented 4 years ago

Hi Martin,

Always happy to incorporate more tree distances. At some point Michelle Kendall had contributed all of the tree distances in her package. Maybe @danlwarren knows where that went, I can't find it now!

We're working on version 2 of the package, and in that version all tree distances are computed when loading the trees. This happens in the tree.dist.matrix.R file.

One thing to note is that speed is of the essence here. Users won't wait a long time to just get a few plots of their MCMC output. In version 2 we subsample the input trees to a list of 1000 evenly spaced trees, meaning that there are 499,000 distances to calculate. So the code will need to be well optimised, and it's even better if you have properly optimised algorithms for calculating distance matrices, rather than just individual distances.

If you're interested I can send you some test files, and you can post some timings here.

Of note, I know that RF is imperfect, but I'm not totally convinced that it's shortcomings are an issue for what we're doing here. One of the main benefits of RF is that it's very very fast, because there are efficient algorithms for computing distance matrices. There are two key uses of tree distances in the package - one is to calculate ESS values for trees (see here: https://academic.oup.com/gbe/article/8/8/2319/2198217), and the other is to plot trees in tree space (or make videos like this: https://twitter.com/RobLanfear/status/1201438206366892032). If you can provide any more details on how/why your metrics improve over RF distances (or Path Difference distances) for these uses, that would be really useful too.

Cheers,

Rob

ms609 commented 4 years ago

Great! Yes, do send on some test files, it'll be interesting to compare running times -- I've done my best to optimise the running time of 'TreeDist' functions.

I totally hear what you say about the need for efficiency. But RF and path are among the worst-performing metrics in "spatial" tests in Smith (2020), such as recovery of clusters and proximity of 'true' tree RF and particularly the path distance. This possibly justifies a longer running time -- I'll be interested to see the effect of the different metrics on a couple of test cases.

roblanf commented 4 years ago

Oh, that sounds really cool. I've noticed actually that in the treespace visualisations the RF distances is kinda odd. I had just put it down to squashing something high-dimensional into 2D, such that there will always be a lot of extra stuff going on in the dimensions you can't see. But if there are better spatial representations, I'm certainly all for it.

I'll email you some test files. Also say hi to Durham for me - I did my undergrad there in 1998-2001. Not sure anyone in the biology dept. would remember me. I, um, didn't go to lectures all that much. Loved it though!

ms609 commented 3 years ago

Hi Rob, just following this up – did you have the chance to look over the pull request (#170)? Do let me know if there's anything else that's needed before it's ready to merge.

I've been looking into treespace visualizations over the past few months. Squashing into 2D is part of the problem, but the RF has a number of 'features' (such as rapid saturation) that introduce problems in the underlying structure of tree space, even before mapping. Information theoretic distances produce 'better' tree spaces and can often be mapped without too much stress using one fewer dimension; quartet distances can map even more nicely, though they take longer to compute. I'll hopefully have a paper on this make it through review soon...

Cheers,

Martin

swofford commented 3 years ago

I also think the matching distance of Bogdanowicz (2008) and (independently) Lin et al. (2011) may be a good choice, and is relatively fast to compute (but of course not as fast as RF).

Dave

• Bogdanowicz D. 2008. Comparing phylogenetic trees using a minimum weight perfect matching. Information Technology, 
• Bogdanowicz D., Giaro K. 2012. Matching split distance for unrooted binary phylogenetic trees. IEEE/ACM Trans Comput Biol Bioinform. 9:150–160.
• Lin Y., Rajan V., Moret B.M.E. 2011. A metric for phylogenetic trees based on matching. Bioinformatics Research and Applications.:197–208.

On Jul 1, 2021, at 7:04 AM, Martin R. Smith @.***> wrote:

Hi Rob, just following this up – did you have the chance to look over the pull request (#170)? Do let me know if there's anything else that's needed before it's ready to merge.

I've been looking into treespace visualizations over the past few months. Squashing into 2D is part of the problem, but the RF has a number of 'features' (such as rapid saturation) that count against it when used in tree space. Information theoretic distances produce better tree spaces and can often be mapped without too much stress using one fewer dimension; quartet distances can map even more nicely, though they take longer to compute.

Cheers,

Martin

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

swofford commented 3 years ago

Please disregard my previous comment. I just had a look at Smith 2020 (which I had missed until now) and see that the matching distance was one of the distances evaluated in that paper. Great paper, Martin!

Dave

On Jul 1, 2021, at 7:04 AM, Martin R. Smith @.***> wrote:

Hi Rob, just following this up – did you have the chance to look over the pull request (#170)? Do let me know if there's anything else that's needed before it's ready to merge.

I've been looking into treespace visualizations over the past few months. Squashing into 2D is part of the problem, but the RF has a number of 'features' (such as rapid saturation) that count against it when used in tree space. Information theoretic distances produce better tree spaces and can often be mapped without too much stress using one fewer dimension; quartet distances can map even more nicely, though they take longer to compute.

Cheers,

Martin

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

roblanf commented 3 years ago

Hi All,

Thanks for the ping on this. Dan and I haven't managed to find the time to work on RWTY2 since everything went insane. But rest assured we'll get to it at some point.

Rob

On Fri, 2 Jul 2021 at 10:35, David Swofford @.***> wrote:

Please disregard my previous comment. I just had a look at Smith 2020 (which I had missed until now) and see that the matching distance was one of the distances evaluated in that paper. Great paper, Martin!

Dave

On Jul 1, 2021, at 7:04 AM, Martin R. Smith @.***> wrote:

Hi Rob, just following this up – did you have the chance to look over the pull request (#170)? Do let me know if there's anything else that's needed before it's ready to merge.

I've been looking into treespace visualizations over the past few months. Squashing into 2D is part of the problem, but the RF has a number of 'features' (such as rapid saturation) that count against it when used in tree space. Information theoretic distances produce better tree spaces and can often be mapped without too much stress using one fewer dimension; quartet distances can map even more nicely, though they take longer to compute.

Cheers,

Martin

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/danlwarren/RWTY/issues/169#issuecomment-872632418, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAG2SEYWRHIOA443U42ZRFTTVUCWZANCNFSM4PI3GR7A .

-- Rob Lanfear Division of Ecology and Evolution, Research School of Biology, The Australian National University, Canberra

www.robertlanfear.com