EcoJulia / Phylo.jl

Simple phylogenetic trees in Julia to work with Diversity.jl - https://github.com/EcoJulia/Diversity.jl
BSD 2-Clause "Simplified" License
33 stars 13 forks source link

Newbie help with trait modelling and tree traversal #71

Open simone66b opened 2 years ago

simone66b commented 2 years ago

Please forgive my ignorance. I am a longtime R programmer/user but am new to Julia. What I really want to do is to recursively traverse the tree so I can simulate stochastic differential equations on the tree, ideally using DifferentialEquations.jl. I am trying to get my head around Phylo.jl as I don't want to re-invent the wheel and anyway Phylo.jl looks more featureful than what I could ever build on my own. Consider the following toy code:

using Phylo
tree = parsenewick("((a:1.0, (b:1.0, c:1.0):1.0):1.0, d:1.0);") 

First, how do I find the root node? getroots() doesn't work. Second, how can I work with an iterator over the branches? I tried this:

collect(getbranches(tree))

6-element Vector{Pair{Int64, Branch{String}}}: [node "Node 3"]-->[1.0 length branch 5]-->[node "Node 2"]

[node "Node 2"]-->[1.0 length branch 4]-->[node "a"]

[node "Node 3"]-->[1.0 length branch 6]-->[node "d"]

[node "Node 1"]-->[1.0 length branch 2]-->[node "b"]

[node "Node 2"]-->[1.0 length branch 3]-->[node "Node 1"]

[node "Node 1"]-->[1.0 length branch 1]-->[node "c"]

I am unsure how to work with this any further. Does the --> symbol mean anything programmatically? Or is it literally just an arrow pointing from node to node/leaf? How do I get access to the information in the elements of the Vector? To put this into more context, here is a toy version of my code using DifferentialEquations.jl which uses my own Node type, which does what I want to do. It doesn't scale up very well because I haven't written a Newick tree parser. I just built the Nodes by hand. (The main reason to use Phylo.jl):

using DifferentialEquations, Plots
mutable struct Node
    name::Int64
    parent::Union{Int64, Missing}
    left::Union{Int64, String}
    right::Union{Int64, String}
    llength::Float64
    rlength::Float64
    lphenotype::Any
    rphenotype::Any
    rheight::Union{Float64, Missing}
    lheight::Union{Float64, Missing}
end

Node1 = Node(1, missing, 2, 3, 4.5, 4.5, [], [], missing, missing)
Node2 = Node(2, 1, "a", 4, 2.0, 2.0, [], [], missing, missing)
Node3 = Node(3, 1, "b", "c", 2.0, 2.0, [], [], missing, missing)
Node4 = Node(4, 2, "d", "e", 2.0, 2.0, [], [], missing, missing)
tree = [Node1, Node2, Node3, Node4]

function mysim(tree)
    nodeInit = findall(x -> ismissing(x.parent) == 1, tree)[1] # find the root
    ## define the OU simulation
    function OU(x0, tspan)
        function drift(x, p, t)
            mu, sigma, alpha = p;
            alpha * (mu - x)
        end

        function diff(x,p, t)
            mu, sigma, alpha = p;
            sigma
        end

        dt = 0.005
        p = (1, 0.3, 1.5);
        prob = SDEProblem(drift, diff, x0, tspan,p)
        sol =  solve(prob, EM(), dt=dt)
    end

        function Heights!(tree, nodeInit)
    ## get the node heights so we know when to start the simulations
            node = tree[nodeInit]
            if ismissing(node.parent)
                node.lheight = node.llength
               node.rheight = node.rlength
            else node.lheight=node.llength + tree[node.parent].lheight
                node.rheight = node.rlength + tree[node.parent].rheight
            end
            if isa(node.left, Int64) Heights!(tree, node.left)
            end
            if isa(node.right, Int64) Heights!(tree, node.right)
            end
        end

    function Recurse!(tree, nodeInit)
        ## recurse along the tree, simulating the OU process for each branch and storing it in lphenotype and rphenotype
        node = tree[nodeInit]

        if ismissing(node.parent) ## the root node, to get started
            node.lphenotype = OU(0.0, (0.0, node.lheight))
            node.rphenotype = OU(0.0, (0.0, node.rheight))

        elseif tree[node.parent].left == node.name
            node.lphenotype = OU(tree[node.parent].lphenotype[end],
                                        (tree[node.parent].lheight, node.lheight))

            node.rphenotype = OU(tree[node.parent].lphenotype[end],
                                        (tree[node.parent].rheight, node.rheight))
        elseif tree[node.parent].right == node.name
            node.lphenotype = OU(tree[node.parent].rphenotype[end],
                                        (tree[node.parent].lheight, node.lheight))
            node.rphenotype = OU(tree[node.parent].rphenotype[end],
                                        (tree[node.parent].rheight, node.rheight))
        end
        if isa(node.left, Int64) Recurse!(tree, node.left) 
        end
        if isa(node.right, Int64) Recurse!(tree, node.right)
        end
    end
    Heights!(tree, nodeInit) # calculate the heights
    Recurse!(tree, nodeInit) # do the recursive simulations
    tree
end

test = mysim(tree)

plot(xlim=[0.0, 9.0], ylim= [0.0, 1.5], legend=nothing)
for i in 1:length(test)
    plot!(test[i].lphenotype.t, test[i].lphenotype.u, legend= nothing)
    plot!(test[i].rphenotype.t, test[i].rphenotype.u, legend = nothing)
end
current()
savefig("test.png")

I'm sorry for the quality of the code, I just translated the way I envisaged it in my head and haven't thought about different ways to do it that might be cleaner and more efficient. Also, I notice the Tutorial/Quick Start document is missing from the website. I would be happy to work on that if you like! I am obviously still learning but I should get into the position where I am able to write such a document.

mkborregaard commented 2 years ago

Hi, I can try to answer some of the smaller questions, though I haven't gone in on your long code chunk.

First, how do I find the root node?

julia> using Phylo

julia> tree = parsenewick("((a:1.0, (b:1.0, c:1.0):1.0):1.0, d:1.0);")
RootedTree with 4 tips, 7 nodes and 6 branches.
Leaf names are c, b, a and d

julia> getroot(tree)
LinkNode Node 7, a root node with 2 outbound connections (branches 5, 6)

How do I get an iterator over the branches?

Do you mean the nodes?

julia> iter = nodeiter(tree)
Phylo.NodeIterator{RootedTree}(RootedTree with 4 tips, 7 nodes and 6 branches.
Leaf names are c, b, a and d
, nothing)

julia> first(iter)
LinkNode a, a tip of the tree with an incoming connection (branch 4).

Or if you do want to collect them into a vector (you usually shouldn't have to)

julia> vec = collect(iter)
7-element Vector{LinkNode{OneRoot, String, Dict{String, Any}, LinkBranch{OneRoot, String, Dict{String, Any}, Float64}}}:
...

julia> vec[3]
LinkNode c, a tip of the tree with an incoming connection (branch 1).

julia> typeof(vec[3])
LinkNode{OneRoot, String, Dict{String, Any}, LinkBranch{OneRoot, String, Dict{String, Any}, Float64}}

julia> vec[3].name
"c"

Does this help a little?

simone66b commented 2 years ago

That helps a bit, but getroot() doesn't work for me:

julia> using Phylo, Plots
julia> tree = parsenewick("((a:1.0, (b:1.0, c:1.0):1.0):1.0, d:1.0);")
NamedTree with 4 tips, 7 nodes and 6 branches.
Leaf names are a, b, c and d

julia> getroot(tree)
ERROR: UndefVarError: getroot not defined
Stacktrace:
 [1] top-level scope
   @ none:1

julia> 
simone66b commented 2 years ago

In fact, most of the advertised functions on [http://docs.ecojulia.org/Phylo.jl/stable/] don't work.

simone66b commented 2 years ago

I'm using Julia Version 1.7.2 (2022-02-06) (@v1.7) pkg> st Phylo Status ~/.julia/environments/v1.7/Project.toml [aea672f4] Phylo v0.3.3 Is my version too old? I just installed it using ] add Phylo.

mkborregaard commented 2 years ago

Yes the current version is 0.4.21. Can you post the full status? Or try to install it into a clean environment?

simone66b commented 2 years ago

(@v1.7) pkg> status Status ~/.julia/environments/v1.7/Project.toml [1520ce14] AbstractTrees v0.2.1 [3637df68] Bio v1.0.1 [8f4d0f93] Conda v1.7.0 [aae7a2af] DiffEqFlux v0.7.0 [0c46a032] DifferentialEquations v6.9.0 [31c24e10] Distributions v0.23.8 [d3d5718d] Diversity v0.4.6 [b2ad6718] EmacsVterm v0.2.0 [cd3eb016] HTTP v0.8.19 [916415d5] Images v0.24.1 [682c06a0] JSON v0.21.3 [b964fa9f] LaTeXStrings v1.3.0 [442fdcdd] Measures v0.3.1 [e1d29d7a] Missings v0.4.5 [f57f5aa1] PNGFiles v0.3.14 [65888b18] ParameterizedFunctions v5.6.0 [aea672f4] Phylo v0.3.3 [c0d5b6db] PhyloPlots v0.3.1 [4c47b132] PhyloTrees v0.11.1 [a03496cd] PlotlyBase v0.3.1 [91a5bcdd] Plots v0.28.4 [d330b81b] PyPlot v2.10.0 [74087812] Random123 v1.4.2 [f2b01f46] Roots v0.8.4 [90137ffa] StaticArrays v0.12.5 [2913bbd2] StatsBase v0.32.2 [f3b207a7] StatsPlots v0.14.5 [789caeaf] StochasticDiffEq v6.16.1

simone66b commented 2 years ago

OK I managed to create a new project and generate the environment, add Phylo and add DifferentialEquations. I now have the latest version of Phylo:

(testJulia) pkg> st Phylo Project testJulia v0.1.0 Status ~/testJulia/Project.toml [aea672f4] Phylo v0.4.21

So far, so good. How do I start a julia process so that it knows about the new project?

mkborregaard commented 2 years ago

This is really useful reading for working with environments https://pkgdocs.julialang.org/v1/environments/ In brief, if you use VSCode it will generally use the environment associated with your work folder - so if you create an environment in the work folder using ]activate . , opening that folder from VScode should automatically activate that environment in a julia session.

This works essentially like a souped-up version of RStudio's "projects" feature.

If you are working with scripts you can specify it like julia --project=testJulia myscript.jl

mkborregaard commented 2 years ago

I guess just an added comment is that the reason you got this version issue is that you have some old package in your default environment that is keeping Phylo back because of dependency issues. For this reason, it is generally recommended to have an empty, or very lean, default environment, and only installing packages into project environments. This greatly reduces the risk of dependency issues.

I'm not myself completely disciplined, but I try - this is my default environment


(@v1.7) pkg> st
      Status `~/.julia/environments/v1.7/Project.toml`
  [6e4b80f9] BenchmarkTools v1.2.2
  [336ed68f] CSV v0.9.11
  [a93c6f00] DataFrames v1.1.1
  [5fb14364] OhMyREPL v0.5.10
  [91a5bcdd] Plots v1.25.7
  [c3e4b0f8] Pluto v0.17.7
  [295af30f] Revise v3.3.1
  [f3b207a7] StatsPlots v0.14.30 
mkborregaard commented 2 years ago

But maybe - let us keep using this issue for your question on trait modelling and tree traversal and take any discussion of environments etc under the #ecology channel on https://julialang.zulipchat.com/register/ ?

simone66b commented 2 years ago

Fair enough. I'll read up about environments. Thanks!

mkborregaard commented 2 years ago

Great. In terms of your core question, could you specify what you mean by

recursively traverse the tree so I can simulate stochastic differential equations on the tree`

? I'm not quite sure I get it.

simone66b commented 2 years ago

Simply that the only? way to simulate trait evolution on a tree is to start at the root and simulate along the branch to the next node. Then use the final value of the trait at that node as the starting value for the simulation along the next branch downstream. To step from the root along the tree it makes sense (to me) to use recursion because you can apply the same trait evolution function (in my example above, the Ornstein-Uhlenbeck process) to each subtree with the new "root" as the root of the subtree and using the starting value for the trait the end value along the branch leading to the new "root." The base case is when you reach a leaf in which case you simply simulate along the branch leading to the leaf and then return from the recursive function. This is the approach that I use in the Recurse! function, above. It works fine but I'm not sure about its use of memory resources or speed. It's certainly much faster than using the same approach in R. If there is another better approach to traversing the tree and simulating trait evolution along the way, I would be glad to know about it. My challenge now is to use the Phylo Tree/Branch/Node data structures (and DifferentialEquations.jl) as a basis for my code above, rather than my simpleminded Node type. Then I will have code that will allow for the evolution of almost any continuous stochastic process along any tree, not just Brownian Motion as in BrownianTrait (I think. I haven't used BrownianTrait. Is there a toy working example for BrownianTrait?)

richardreeve commented 2 years ago

I'm not quite sure what you need here, but it seems like you may be reconstructing code that already exists. There is a traversal() function that can give you depth-first or breadth-first access to nodes (apologies, TraversalOrder itself is not documented, but its definition is here). You can also simulate traits on the tree directly - the docs are here.

Maybe if you can explain what they don't do that you need, we can work out where to go from there?

mkborregaard commented 2 years ago

@mschauer used Phylo to model OU processes a few years back

mschauer commented 2 years ago

Yeah, I have some code for that somewhere, but essentially you need to do tree traversal and call StochDiffEq on each segment. Do you want to forward simulate or also do inference on the parameters given data?

simone66b commented 2 years ago

Ideally both simulation and inference. I have code that works now. Essentially I do a recursive preorder tree traversal and use the previous branch's last simulated value as the starting value for the current branch. (I store the simulations in the branch rather than the node.) Not thought heavily about inference yet but my previous work in R suggests that it is difficult if all you have are the data at the leaves. There's just not much information about the evolutionary dynamics.