evolbioinfo / gotree

Gotree is a set of command line tools and an API to manipulate phylogenetic trees. It is implemented in Go language.
GNU General Public License v2.0
118 stars 15 forks source link

Adding branch length aware metrics to `gotree compare trees` #20

Closed lucblassel closed 1 year ago

lucblassel commented 1 year ago

In some cases it is useful to take into account branch lengths when comparing trees and not only topology. I have extended the gotree compare trees to add a --weighted flag that will compute the weighted Robinson-Foulds and the Khuner-Felsenstein distance.

Metric definitions

The Weighted Robinson-Foulds distance is defined in (Robinson & Foulds, 1979) is defined as the absolute difference of branch lengths for matched bipartitions between two trees, plus the branch lengths of unique bipartitions:

$$ RF{weighted} = \sum{e \in A\cap B} |d{(e,A)} - d{(e,B)}| + \sum{e \in A\setminus B}d{(e,A)} + \sum{e \in B\setminus A}d{(e,B)} $$

Where $A$ and $B$ are the sets of bipartitions of the first and second trees, and $d_{(e,A)}$ the branch length of bipartition $e$ in the first tree ($A$).

The Khuner-Felsenstein distance is very similar, but instead of using the absolute difference it sums the square difference of lengths for common bipartitions with the squared lengths of unique bipartitions and takes the root of all that:

$$ KF = \sqrt{ \sum{e \in A\cap B} (d{(e,A)} - d{(e,B)})^2 + \sum{e \in A\setminus B}d{(e,A)}^2 + \sum{e \in B\setminus A}d_{(e,B)}^2 } $$

Code changes

I added a new function called CompareWeighted that is heavily inspired from the Compare function in tree/algo.go. It returns a WeightedBipartitionStats struct that holds slices for: branch lengths unique to the reference tree, branch lengths unique to the compared tree, and difference of branch lengths for common bipartitions. This new version computes the edge index for the reference tree and then for each compared tree it:

  1. computes the edge index of the compared tree
  2. for each edge in the compared edge index, check if it is present in the reference index:
    • If it is present, add the difference between compared and reference edge length to the WeightedBipartitionStats struct
    • if not, add the compared branch length to the WeightedBipartitionStats struct.
  3. for each edge in the reference edge index, check if it is present in the compared index, and if it is not, add the reference branch length to the WeightedBipartitionStats struct.

N.B: I thought of just modifying the Compare function instead of writing this new one, since we can derive the number of common and unique edges by looking at the length of the slices in the WeightedBipartitionStats struct, however it is a waste of memory to keep track of all the branch lengths if we are not going to use them, especially for large trees.

Usage

we can compute these metrics with the --weighted flag:

gotree compare trees --weighted -i <(gotree generate yuletree --seed 10) -c <(gotree generate yuletree --seed 12 -n 1)
tree weighted_RF KF
0 1.310593E+00 5.056856E-01

If we just want to check if trees are identical both in terms of topology and branch lengths we can use the --binary flag with the --weighted one, e.g. in the following example we compare two trees that have the same topology but different branch lengths, so they will be considered as not identical:

gotree compare trees --weighted --binary \
 -i  <(echo "(A:0.1,(B:0.1,(H:0.1,(D:0.1,(J:0.1,(((G:0.1,E:0.1):0.1,(F:0.1,I:0.1):0.1):0.1,C:0.1):0.1):0.1):0.1):0.1):0.1);")  \
 -c  <(echo "(A:0.2,(B:0.2,(H:0.2,(D:0.2,(J:0.2,(((G:0.2,E:0.2):0.2,(F:0.2,I:0.2):0.2):0.2,C:0.2):0.2):0.2):0.2):0.2):0.2);")

To check that the metrics are correct I used the examples from the Felsenstein lab webage:

We can use the following trees:

echo "(A:0.1,(B:0.1,(H:0.1,(D:0.1,(J:0.1,(((G:0.1,E:0.1):0.1,(F:0.1,I:0.1):0.1):0.1,C:0.1):0.1):0.1):0.1):0.1):0.1);
(A:0.1,(B:0.1,(D:0.1,((J:0.1,H:0.1):0.1,(((G:0.1,E:0.1):0.1,(F:0.1,I:0.1):0.1):0.1,C:0.1):0.1):0.1):0.1):0.1);
(A:0.1,(B:0.1,(D:0.1,(H:0.1,(J:0.1,(((G:0.1,E:0.1):0.1,(F:0.1,I:0.1):0.1):0.1,C:0.1):0.1):0.1):0.1):0.1):0.1);
(A:0.1,(B:0.1,(E:0.1,(G:0.1,((F:0.1,I:0.1):0.1,((J:0.1,(H:0.1,D:0.1):0.1):0.1,C:0.1):0.1):0.1):0.1):0.1):0.1);
(A:0.1,(B:0.1,(E:0.1,(G:0.1,((F:0.1,I:0.1):0.1,(((J:0.1,H:0.1):0.1,D:0.1):0.1,C:0.1):0.1):0.1):0.1):0.1):0.1);
(A:0.1,(B:0.1,(E:0.1,((F:0.1,I:0.1):0.1,(G:0.1,((J:0.1,(H:0.1,D:0.1):0.1):0.1,C:0.1):0.1):0.1):0.1):0.1):0.1);
(A:0.1,(B:0.1,(E:0.1,((F:0.1,I:0.1):0.1,(G:0.1,(((J:0.1,H:0.1):0.1,D:0.1):0.1,C:0.1):0.1):0.1):0.1):0.1):0.1);
(A:0.1,(B:0.1,(E:0.1,((G:0.1,(F:0.1,I:0.1):0.1):0.1,((J:0.1,(H:0.1,D:0.1):0.1):0.1,C:0.1):0.1):0.1):0.1):0.1);
(A:0.1,(B:0.1,(E:0.1,((G:0.1,(F:0.1,I:0.1):0.1):0.1,(((J:0.1,H:0.1):0.1,D:0.1):0.1,C:0.1):0.1):0.1):0.1):0.1);
(A:0.1,(B:0.1,(E:0.1,(G:0.1,((F:0.1,I:0.1):0.1,((J:0.1,(H:0.1,D:0.1):0.1):0.1,C:0.1):0.1):0.1):0.1):0.1):0.1);
(A:0.1,(B:0.1,(D:0.1,(H:0.1,(J:0.1,(((G:0.1,E:0.1):0.1,(F:0.1,I:0.1):0.1):0.1,C:0.1):0.1):0.1):0.1):0.1):0.1);
(A:0.1,(B:0.1,(E:0.1,((G:0.1,(F:0.1,I:0.1):0.1):0.1,((J:0.1,(H:0.1,D:0.1):0.1):0.1,C:0.1):0.1):0.1):0.1):0.1);"> trees.nwk

and compute all the pairwise KF distances using gotree:

var=1
while IFS= read -r line; do
  echo "$line" | ../gotree compare trees --weighted -c trees.nwk | cut -d $'\t' -f3 > $var.txt
  var=$((var + 1))
done < trees.nwk

# Make a pretty table
paste {1,2,3,4,5,6,7,8,9,10,11,12}.txt > table.tsv
# Remove temp files
rm {1,2,3,4,5,6,7,8,9,10,11,12}.txt

If we print table.csv we get:

KF KF KF KF KF KF KF KF KF KF KF KF
0.000000E+00 2.000000E-01 1.414214E-01 3.162278E-01 3.162278E-01 3.162278E-01 3.162278E-01 3.162278E-01 3.162278E-01 3.162278E-01 1.414214E-01 3.162278E-01
2.000000E-01 0.000000E+00 1.414214E-01 3.162278E-01 2.828427E-01 3.162278E-01 2.828427E-01 3.162278E-01 2.828427E-01 3.162278E-01 1.414214E-01 3.162278E-01
1.414214E-01 1.414214E-01 0.000000E+00 3.162278E-01 3.162278E-01 3.162278E-01 3.162278E-01 3.162278E-01 3.162278E-01 3.162278E-01 0.000000E+00 3.162278E-01
3.162278E-01 3.162278E-01 3.162278E-01 0.000000E+00 1.414214E-01 1.414214E-01 2.000000E-01 1.414214E-01 2.000000E-01 0.000000E+00 3.162278E-01 1.414214E-01
3.162278E-01 2.828427E-01 3.162278E-01 1.414214E-01 0.000000E+00 2.000000E-01 1.414214E-01 2.000000E-01 1.414214E-01 1.414214E-01 3.162278E-01 2.000000E-01
3.162278E-01 3.162278E-01 3.162278E-01 1.414214E-01 2.000000E-01 0.000000E+00 1.414214E-01 1.414214E-01 2.000000E-01 1.414214E-01 3.162278E-01 1.414214E-01
3.162278E-01 2.828427E-01 3.162278E-01 2.000000E-01 1.414214E-01 1.414214E-01 0.000000E+00 2.000000E-01 1.414214E-01 2.000000E-01 3.162278E-01 2.000000E-01
3.162278E-01 3.162278E-01 3.162278E-01 1.414214E-01 2.000000E-01 1.414214E-01 2.000000E-01 0.000000E+00 1.414214E-01 1.414214E-01 3.162278E-01 0.000000E+00
3.162278E-01 2.828427E-01 3.162278E-01 2.000000E-01 1.414214E-01 2.000000E-01 1.414214E-01 1.414214E-01 0.000000E+00 2.000000E-01 3.162278E-01 1.414214E-01
3.162278E-01 3.162278E-01 3.162278E-01 0.000000E+00 1.414214E-01 1.414214E-01 2.000000E-01 1.414214E-01 2.000000E-01 0.000000E+00 3.162278E-01 1.414214E-01
1.414214E-01 1.414214E-01 0.000000E+00 3.162278E-01 3.162278E-01 3.162278E-01 3.162278E-01 3.162278E-01 3.162278E-01 3.162278E-01 0.000000E+00 3.162278E-01
3.162278E-01 3.162278E-01 3.162278E-01 1.414214E-01 2.000000E-01 1.414214E-01 2.000000E-01 0.000000E+00 1.414214E-01 1.414214E-01 3.162278E-01 0.000000E+00

which is the same as the one on the web page

fredericlemoine commented 1 year ago

Thanks @lucblassel for this detailed PR!