Evolution is often well represented by bifurcating trees. Different processes can create "trees within trees" structures. How those two tree structures map to each other can be inferred using phylogenetic reconciliations
This is a simple tutorial on how to use ALE, a software that reconciles gene trees and species tree. In this tutorial, we are going to learn how to reconcile a gene tree and a species tree, and how to interpret the output of ALE.
You need two things to run a reconciliation: a gene tree and a species tree. The species tree can be a dated tree, but this is not a requirement (see the section differences between ALE dated and ALE undated). The gene tree can be a distribution of gene trees, as the ones inferred by using the ultrafast bootsrap of IQ-TREE (toggle the -w option to write the individual trees). Using a gene tree distribution instead of a single tree is encouraged, because it informs ALE about the uncertainty in the topology of the gene tree and allows it to make better predictions.
In this example, we are going to use some simulated data I generated using Zombi. I generated a small species tree with 10 leaves and I simulated 100 gene families that were present in the root. The families evolved following events of Duplications, Transfers and Losses. Zombi outputs the final gene trees, which is what we will use in this tutorial (there is no need to use a distribution of trees in this case because given that we are using simulated dated, we are certain about the topology of the tree).
The first thing to do is, for every gene family, (files in the Trees folder), obtain the .ale file. The .ale file contains the CCPs (conditional clade probabilities), which are used later by ALE to estimate the likelihood of the different reconciliations. The files are very easy to obtain, the command is:
ALEobserve 1_prunedree.nwk
This will generate the file:
1_prunedtree.nwk.ale
The .ale files can be found in the folder ALEs. The ALE file format is simply an efficient way to store information about the gene tree distribution that takes up less disk space than, for example, a complete set of bootstrap or MCMC trees in newick format.
Once we have this, we run the reconciliation by using the command:
ALEml_undated SpeciesTree.nwk 10_prunedtree.nwk.ale
This will produce two files: the uml_rec file and the uTs file. All the files have already been computed and the reader can inspect them in the different folders of this repository.
NOTE: The genes in the gene tree must have a mapping to the species tree. The usual way is using this format: SPECIES_GENEID
The script ale_splitter.py can be used to obtain that information in different files, which simplifies the process of parsing them with Python or R later:
python ale_splitter.py -i S_S_COG3397.ufboot.ale.uml_rec -sftr
If the user is dealing with a large number of reconciliations, there is a different and better way to extract that information. This script creates two big tables with all the relevant information per family
python ale_parser.py -i FolderWithReconciliations -sft
ALE takes the input Species Tree and renames the inner nodes of the species tree. An easy way to visualize the tree is copying it and pasting into SeaView. For instance, the tree we are using in this example looks like this.
In this tree, 12 corresponds to the name of the branch leading to the common ancestor of n18 and n17, 17 is the parent branch of branches 16 and 11, and so on. Those branch names are the name used in the uml_rec and uTs files.
ALE infers by default 100 reconciliations. It creates two files
Files uTs: They contain information about the Lateral Gene Transfers. The columns indicate the donor branch, the recipient branch and the weight of the transfer, i.e. the number of times that the transfer has been found divided by the number of reconciliations
Files uml_rec.
What does it mean that a gene family has 0.5 transfers? The values represented in the table correspond to counting the total number of events in the 100 reconciled trees and dividing this number by 100. This means that if half of the reconciliations have a single transfer events (and not the other half), we will see that a family has 0.5 transfer. The correct way to interpret this number is as a probability of a transfer taking place in this family.
Branch - Code of the branch according to the Species Tree
BranchType - Whether the branch is a terminal branch or an inner branch
Duplications - Average number of Duplications events in the branch
Transfers- Average number of Transfer events in the branch
Losses - Average number of Loss events in the branch
Originations - Fraction of times that the Gene Family starts in this specific branch
Copies - Average number of copies in the branch
Singletons - Average number of genes that are seeing as vertically evolving, i.e. the gene can be found at the beginning of the branch and at the end
Extinctinonprob - The likelihood that a gene present in this branch will eventually go extinct
Presence - Between 0 and 1, number of times the gene family is present in this branch
LL - The likelihood of the gene family originating at this specific branch
In Coleman et al. 2021 (A rooted phylogeny of bacteria resolves early evolution), we proposed two metrics to evaluate the amount of transfers vs the amount of vertical evolution. The first one is verticality, a branch wise metric which is defined as the total number of singletons inferred in a branch divided by the singletons, originations and transfers.
This measure can be obtained with (for example) pandas. The user can find a notebook in this repository (called AnalyzingResults.ipyb) that shows how to do it.
df = pd.read_csv("TableEvents.tsv", sep = "\t")
dfb = df.groupby("Branch", as_index=False).sum()
dfb["Verticality"] = dfb["singletons"] / (dfb["singletons"] + dfb["Originations"] + dfb["Transfers"])
The second metric is called transfer propensity, and it is a family metric. It measures, as the name indicates, how prone to be a transferred a family is.
df = pd.read_csv("TableEvents.tsv", sep = "\t")
dff = df.groupby("Family").sum()
dff["TransferPropensity"] = dff["Transfers"] / (dff["singletons"] + dff["Transfers"])
There are at least two very important things to remember:
The main innovation of ALE is that it does not reconcile a single gene tree to the species tree. Instead, it uses a distribution of gene trees, such as the one obtained by Bootstrap or the posterior chain of a Bayesian method. In most phylogenetic studies those trees are summarized as a single tree with some associated values for the internal nodes representing the degree of confidence in that node. What ALE does is taking the distribution of trees and reconcile the different splits that are found in the trees, weighting them by their frequency. By doing this, ALE manages to account for the uncertainty in the gene trees.
There are many useful things such as:
Etc
ALE dated is the original model: in this model the transfers are inferred to occur between contemporary lineages. This is guaranteed by using as input a dated tree, i.e. a tree in which the branch lengths are proportional to time. These have different problems (see Davin et al 2022 for a more detailed discussion), but the main one is that obtaining dated trees is difficult. ALE dated is also slow. For that reason, ALE undated (Szollosi 2014) was developed. Some of the transfers obtained can be time inconsistent (i.e. it is impossible to make a dated tree that meets all the constraints imposed by the transfers)