In some cases it is useful to take into account branch lengths when comparing trees and not only topology.
I have extended the gotree compare trees to add a --weighted flag that will compute the weighted Robinson-Foulds and the Khuner-Felsenstein distance.
Metric definitions
The Weighted Robinson-Foulds distance is defined in (Robinson & Foulds, 1979) is defined as the absolute difference of branch lengths for matched bipartitions between two trees, plus the branch lengths of unique bipartitions:
Where $A$ and $B$ are the sets of bipartitions of the first and second trees, and $d_{(e,A)}$ the branch length of bipartition $e$ in the first tree ($A$).
The Khuner-Felsenstein distance is very similar, but instead of using the absolute difference it sums the square difference of lengths for common bipartitions with the squared lengths of unique bipartitions and takes the root of all that:
I added a new function called CompareWeighted that is heavily inspired from the Compare function in tree/algo.go. It returns a WeightedBipartitionStats struct that holds slices for: branch lengths unique to the reference tree, branch lengths unique to the compared tree, and difference of branch lengths for common bipartitions.
This new version computes the edge index for the reference tree and then for each compared tree it:
computes the edge index of the compared tree
for each edge in the compared edge index, check if it is present in the reference index:
If it is present, add the difference between compared and reference edge length to the WeightedBipartitionStats struct
if not, add the compared branch length to the WeightedBipartitionStats struct.
for each edge in the reference edge index, check if it is present in the compared index, and if it is not, add the reference branch length to the WeightedBipartitionStats struct.
N.B: I thought of just modifying the Compare function instead of writing this new one, since we can derive the number of common and unique edges by looking at the length of the slices in the WeightedBipartitionStats struct, however it is a waste of memory to keep track of all the branch lengths if we are not going to use them, especially for large trees.
Usage
we can compute these metrics with the --weighted flag:
If we just want to check if trees are identical both in terms of topology and branch lengths we can use the --binary flag with the --weighted one, e.g. in the following example we compare two trees that have the same topology but different branch lengths, so they will be considered as not identical:
In some cases it is useful to take into account branch lengths when comparing trees and not only topology. I have extended the
gotree compare trees
to add a--weighted
flag that will compute the weighted Robinson-Foulds and the Khuner-Felsenstein distance.Metric definitions
The Weighted Robinson-Foulds distance is defined in (Robinson & Foulds, 1979) is defined as the absolute difference of branch lengths for matched bipartitions between two trees, plus the branch lengths of unique bipartitions:
$$ RF{weighted} = \sum{e \in A\cap B} |d{(e,A)} - d{(e,B)}| + \sum{e \in A\setminus B}d{(e,A)} + \sum{e \in B\setminus A}d{(e,B)} $$
Where $A$ and $B$ are the sets of bipartitions of the first and second trees, and $d_{(e,A)}$ the branch length of bipartition $e$ in the first tree ($A$).
The Khuner-Felsenstein distance is very similar, but instead of using the absolute difference it sums the square difference of lengths for common bipartitions with the squared lengths of unique bipartitions and takes the root of all that:
$$ KF = \sqrt{ \sum{e \in A\cap B} (d{(e,A)} - d{(e,B)})^2 + \sum{e \in A\setminus B}d{(e,A)}^2 + \sum{e \in B\setminus A}d_{(e,B)}^2 } $$
Code changes
I added a new function called
CompareWeighted
that is heavily inspired from theCompare
function intree/algo.go
. It returns aWeightedBipartitionStats
struct that holds slices for: branch lengths unique to the reference tree, branch lengths unique to the compared tree, and difference of branch lengths for common bipartitions. This new version computes the edge index for the reference tree and then for each compared tree it:WeightedBipartitionStats
structWeightedBipartitionStats
struct.WeightedBipartitionStats
struct.N.B: I thought of just modifying the
Compare
function instead of writing this new one, since we can derive the number of common and unique edges by looking at the length of the slices in theWeightedBipartitionStats
struct, however it is a waste of memory to keep track of all the branch lengths if we are not going to use them, especially for large trees.Usage
we can compute these metrics with the
--weighted
flag:If we just want to check if trees are identical both in terms of topology and branch lengths we can use the
--binary
flag with the--weighted
one, e.g. in the following example we compare two trees that have the same topology but different branch lengths, so they will be considered as not identical:To check that the metrics are correct I used the examples from the Felsenstein lab webage:
We can use the following trees:
and compute all the pairwise KF distances using gotree:
If we print table.csv we get:
which is the same as the one on the web page