ericgoolsby / Rphylopars

Phylogenetic Comparative Tools for Missing Data and Within-Species Variation
28 stars 11 forks source link

understanding various OU options #49

Open jakeberv opened 2 years ago

jakeberv commented 2 years ago

Hi Eric-- I was wondering if you could help me understand a couple of options in Rphylopars --

mainly, what is the difference between these two models?

lets say trait_data is a multivariate trait dataset

model1 <- phylopars(trait_data , tree, model = "OU", phylo_correlated = T)

model2 <- phylopars(trait_data , tree, model = "mvOU", phylo_correlated = T)

Are these models equivalent under the hood because phylo_correlated is set to T?

It seems like they are not, but I had thought they were...

Thanks, Jacob Berv

ericgoolsby commented 2 years ago

Great question! First, just to clarify the options of a BM model. The option phylo_correlated = TRUE tells the evolutionary rate parameter ('phylogenetic' trait covariance) to allow traits to be correlated as they evolve on the tree. In general you probably always want this set to TRUE (otherwise, you might as well just run one trait at a time). One other thing -- if you have multiple observations per species, then pheno_correlated = TRUE kicks in automatically, which assumes that traits are also correlated within species as well. Sometimes this is desirable, but sometimes it might not be appropriate (or it might involve fitting too many parameters if you have several traits). To turn off within-species correlations but still account for the multiple observations, you can set pheno_correlated = FALSE and pheno_error = TRUE. Or, you can set both to FALSE and then Rphylopars will simply take the species-level means of each trait and ignore the within-species variation.

On to the OU model and its various options. The model = "OU" option (along with the other 'univariate' tree transformations "lambda", "kappa", "delta", "EB", and "star") imposes the assumption that all traits (and their covariances) evolve under the same tree transformation.

For example, suppose I fit an OU model to the following two-trait dataset:

set.seed(1)
tree <- ape::rcoal(n = 50)
trait_data <- data.frame(species = tree$tip.label,V1 = ape::rTraitCont(tree),V2 = ape::rTraitCont(tree))
mod.1 <- Rphylopars::phylopars(trait_data = trait_data,tree = tree,model = "OU")
mod.1

# Phylogenetic trait variance-covariance
#              V1           V2
# V1 0.0122552579 0.0002541291
# V2 0.0002541291 0.0127591275
# 
# OU alpha = 0.4830562
# Stationary covariance = 
#             V1          V2
# V1 0.012685127 0.000263043
# V2 0.000263043 0.013206671

We get an OU alpha of about 0.483. This is equivalent to transforming the tree, and fitting a BM model:

rescaled_tree <- geiger::rescale(tree,model = "OU",alpha = mod.1$model$alpha)
mod.2 <- Rphylopars::phylopars(trait_data = trait_data,tree = rescaled_tree,model = "BM")
mod.2

# Phylogenetic trait variance-covariance
#              V1           V2
# V1 0.0122552513 0.0002541286
# V2 0.0002541286 0.0127591182
# 
# Brownian motion model

The models even have identical log-likelihoods:

logLik(mod.1)
# 'log Lik.' 213.2259 (df=4)

logLik(mod.2)
# 'log Lik.' 213.2259 (df=3)

This is also identical to the mvOU model if we fix the alpha matrix to a diagonal matrix with identical entries equal to our alpha from mod.1. This is far less computationally efficient than model = "OU" and you probably wouldn't do this in practice, but it's helpful to see what's going on:

diag_alpha_mat <- diag(rep(mod.1$model$alpha,2))
diag_alpha_mat
#           [,1]      [,2]
# [1,] 0.4830562 0.0000000
# [2,] 0.0000000 0.4830562

In practice, if using model = "mvOU", you would probably let the alpha matrix optimize to its (restricted) maximum likelihood values. You can constrain the alpha matrix to be diagonal, or you can allow a full symmetric matrix. To constrain the matrix to be diagonal, you set full_alpha = FALSE.

mod.4 <- Rphylopars::phylopars(trait_data = trait_data,tree = tree,model = "mvOU",full_alpha = FALSE)
mod.4
# Phylogenetic trait variance-covariance
#            V1         V2
# V1 0.01387363 0.00041003
# V2 0.00041003 0.01173209
# 
# mvOU alpha = 
#          V1           V2
# V1 1.697939 0.0000000000
# V2 0.000000 0.0002641328
# 
# 
#Stationary covariance = 
#              V1           V2
# V1 0.0040854312 2.414493e-04
# V2 0.0002414493 2.220869e+01

For a full symmetric alpha matrix, set full_alpha = TRUE (the default):

mod.5 <- Rphylopars::phylopars(trait_data = trait_data,tree = tree,model = "mvOU")
mod.5
# Phylogenetic trait variance-covariance
#               V1            V2
# V1  0.0138680075 -0.0004826286
# V2 -0.0004826286  0.0118835696
# 
# mvOU alpha = 
#           V1         V2
# V1  1.789375 -0.5083880
# V2 -0.508388  0.1458612
# 
# 
# Stationary covariance = 
#           V1       V2
# V1 0.3384726 1.177684
# V2 1.1776838 4.145462

I based the mvOU model off of Julien Clavel's mvMORPH package implementation. The mvMORPH package can handle much more complex evolutionary models -- in particular, regime shifts that can account for rate shifts, shifts in selective optima, etc. There's a nice description in the corresponding Methods in Ecology & Evolution paper and Appendix 1, with worked examples in Appendix 2 (Clavel et al. 2015).

There is a ton of literature on the OU model and its various formulations. If interested, check out these cited papers from Clavel et al. 2015: Hansen 1997, 2012; Butler & King 2004, 2009; Bartoszek et al. 2012; Monteiro 2013; Gardiner 2004; Reitan, Schweder & Henderiks 2012.

jakeberv commented 2 years ago

Cool- thanks for that thorough explanation,very helpful

How might one take an mvOU fit from rphylopars and then simulate datasets under mvOU? I see one can do this in mvMORPH with a fitted object there -- I'm getting some convergence errors in mvMORPH that don't seem to occur with rphylopars... Doesn't look like the simtraits function can do mvOU.

Best,

J

On Sun, Feb 13, 2022, 3:31 PM Eric W. Goolsby @.***> wrote:

Great question! First, just to clarify the options of a BM model. The option phylo_correlated = TRUE tells the evolutionary rate parameter ('phylogenetic' trait covariance) to allow traits to be correlated as they evolve on the tree. In general you probably always want this set to TRUE (otherwise, you might as well just run one trait at a time). One other thing -- if you have multiple observations per species, then pheno_correlated = TRUE kicks in automatically, which assumes that traits are also correlated within species. Sometimes this is desirable, but sometimes it might not be appropriate (or it might involve fitting too many parameters if you have several traits). To turn off within-species correlations but still account for the multiple observations, you can set pheno_correlated = FALSE and pheno_error = TRUE.

On to the OU model and its various options. The model = "OU option (along with the other 'univariate' tree transformations "lambda", "kappa", "delta", "EB", and "star") imposes the assumption that all traits (and their covariances) evolve under the same tree transformation.

For example, suppose I fit an OU model to the following two-trait dataset:

set.seed(1) tree <- ape::rcoal(n = 50) trait_data <- data.frame(species = tree$tip.label,V1 = ape::rTraitCont(tree),V2 = ape::rTraitCont(tree)) mod.1 <- Rphylopars::phylopars(trait_data = trait_data,tree = tree,model = "OU") mod.1

Phylogenetic trait variance-covariance

V1 V2

V1 0.0122552579 0.0002541291

V2 0.0002541291 0.0127591275

#

OU alpha = 0.4830562

Stationary covariance =

V1 V2

V1 0.012685127 0.000263043

V2 0.000263043 0.013206671

We get an OU alpha of about 0.483. This is equivalent to transforming the tree, and fitting a BM model:

rescaled_tree <- geiger::rescale(tree,model = "OU",alpha = mod.1$model$alpha) mod.2 <- Rphylopars::phylopars(trait_data = trait_data,tree = rescaled_tree,model = "BM") mod.2

Phylogenetic trait variance-covariance

V1 V2

V1 0.0122552513 0.0002541286

V2 0.0002541286 0.0127591182

#

Brownian motion model

The models even have identical log-likelihoods:

logLik(mod.1)

'log Lik.' 213.2259 (df=4)

logLik(mod.2)

'log Lik.' 213.2259 (df=3)

This is also identical to the mvOU model if we fix the alpha matrix to a diagonal matrix with identical entries equal to our alpha from mod.1. This is far less computationally efficient than model = "OU" and you probably wouldn't do this in practice, but it's helpful to see what's going on:

diag_alpha_mat <- diag(rep(mod.1$model$alpha,2)) diag_alpha_mat

[,1] [,2]

[1,] 0.4830562 0.0000000

[2,] 0.0000000 0.4830562

In practice, if using model = "mvOU", you would probably let the alpha matrix optimize to its (restricted) maximum likelihood values. You can constrain the alpha matrix to be diagonal, or you can allow a full symmetric matrix. To constrain the matrix to be diagonal, you set full_alpha = FALSE.

mod.4 <- Rphylopars::phylopars(trait_data = trait_data,tree = tree,model = "mvOU",full_alpha = FALSE) mod.4

Phylogenetic trait variance-covariance

V1 V2

V1 0.01387363 0.00041003

V2 0.00041003 0.01173209

#

mvOU alpha =

V1 V2

V1 1.697939 0.0000000000

V2 0.000000 0.0002641328

# #

Stationary covariance =

V1 V2

V1 0.0040854312 2.414493e-04

V2 0.0002414493 2.220869e+01

For a full symmetric alpha matrix, set full_alpha = TRUE (the default):

mod.5 <- Rphylopars::phylopars(trait_data = trait_data,tree = tree,model = "mvOU") mod.5

Phylogenetic trait variance-covariance

V1 V2

V1 0.0138680075 -0.0004826286

V2 -0.0004826286 0.0118835696

#

mvOU alpha =

V1 V2

V1 1.789375 -0.5083880

V2 -0.508388 0.1458612

# #

Stationary covariance =

V1 V2

V1 0.3384726 1.177684

V2 1.1776838 4.145462

I based the mvOU model off of Julien Clavel's mvMORPH implementation. The mvMORPH package can handle much more complex evolutionary models -- in particular, regime shifts that can account for rate shifts, shifts in selective optima, etc. There's a nice description in the corresponding Methods in Ecology & Evolution paper https://besjournals.onlinelibrary.wiley.com/doi/10.1111/2041-210X.12420 and Appendix 1 https://besjournals.onlinelibrary.wiley.com/action/downloadSupplement?doi=10.1111%2F2041-210X.12420&file=mee312420-sup-0001-AppendixS1.pdf, with worked examples in Appendix 2 https://besjournals.onlinelibrary.wiley.com/action/downloadSupplement?doi=10.1111%2F2041-210X.12420&file=mee312420-sup-0002-AppendixS2.pdf (Clavel et al. 2015).

There is a ton of literature on the OU model and its various formulations. If interested, check out these cited papers from Clavel et al. 2015: Hansen 1997, 2012; Butler & King 2004, 2009; Bartoszek et al. 2012; Monteiro 2013; Gardiner 2004; Reitan, Schweder & Henderiks 2012.

— Reply to this email directly, view it on GitHub https://github.com/ericgoolsby/Rphylopars/issues/49#issuecomment-1038408722, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKMB23W7MFIEA4T5EPVAAMDU3AILHANCNFSM5NVQOEFA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you authored the thread.Message ID: @.***>

ericgoolsby commented 2 years ago

Unfortunately that's not implemented (yet), though I hope to do so one day! But you should be able to do this in mvMORPH (see below). One thing I should also mention, I would highly recommend installing the current GitHub version of Rphylopars rather than the CRAN version. I patched it a few days ago to fix a bug that caused simtraits to always simulate a root value of 1 (regardless of what the user set anc to). A kind user brought this to my attention, and the current version on GitHub resolves the issue.

As for mvMORPH, if you want to try to get it to work for your purposes, you might be able to resolve the numerical issues by simply setting method = "inverse", or if that doesn't work, method = "pseudoinverse". These are both slower than the default settings (perhaps prohibitively slow for very large datasets), but they are more numerically stable (with pseudoinverse being the most stable). Hopefully that does the trick, but if that still doesn't work, you might try playing around with the decomp and param arguments (see Details section of ?mvOU).

You can also use mvSIM without having a fitted mvMORPH object. The only challenge (ironically) is the simplest case, when you don't have multiple regimes (like with Rphylopars). In this case, you have to 'trick' the function because mvSIM requires a simmap tree, and simmap trees have to have at least 2 regimes. Fortunately it's straightforward. Simply create a simmap tree with 2 regimes (a and b), with all of the species belonging to the first regime and none of the species belonging to the second regime. Here's an example with 2 traits. The alpha values for the traits are 2 and 3; the sigma rate matrix has variances of 1 with covariance 0.7; the root state (theta) has states -2 and 5. Note for theta that you have to create a theta for each regime (a and b), but the b regime isn't actually used.

tree <- phytools::pbtree(n=50)
regimes <- matrix(c(rep(1,length(tree$tip.label)),rep(0,length(tree$tip.label))),nrow = length(tree$tip.label),ncol = 2,dimnames = list(tree$tip.label,c("a","b")))

regimes
#     a b
# t11 1 0
# t49 1 0
# t50 1 0
# t18 1 0
# t47 1 0
# ...

simtree <- make.simmap(tree = tree,x = regimes,model = "ER",nsim=1)

alpha <- diag(2:3)
sigma <- diag(2)*.3 + .7
theta <- matrix(data = c(-2,-2,5,5),nrow = 2,ncol = 2,dimnames = list(c("a","b"),c("trait1","trait2")))

alpha
#      [,1] [,2]
# [1,]    2    0
# [2,]    0    3

sigma
#      [,1] [,2]
# [1,]  1.0  0.7
# [2,]  0.7  1.0

theta
#   trait1 trait2
# a     -2      5
# b     -2      5

sims <- mvSIM(tree = simtree,model = "OUM",param = list(sigma = sigma,alpha = alpha,theta = theta),nsim = 1)

sims
#          [,1]     [,2]
# t11 -1.710623 5.501499
# t49 -2.590463 4.517010
# t50 -2.548926 4.632217
# t18 -1.059882 5.827987
# t47 -1.871089 4.704690
# t48 -1.816265 5.122190
# ...
jakeberv commented 2 years ago

Awesome — thank you so much for taking the time to explain all of this. It all makes sense —

Re mvMORPH - I have tried the inverse method and did not achieve convergence, and R crashed the last time I tried pseudoinverse (8 dimensional dataset) — so maybe a tough problem. RPhylopars seemed to crunch it fine with the default mvOU model and did not throw any warnings.

Best, Jake

On Feb 13, 2022, at 5:44 PM, Eric W. Goolsby @.***> wrote:

Unfortunately that's not implemented (yet), though I hope to do so one day! But you should be able to do this in mvMORPH (see below). One thing I should also mention, I would highly recommend installing the current GitHub version of Rphylopars rather than the CRAN version. I patched it a few days ago to fix a bug that caused simtraits to always simulate a root value of 1 (regardless of what the user set anc to). A kind user brought this to my attention, and the current version on GitHub resolves the issue.

As for mvMORPH, if you want to try to get it to work for your purposes, you might be able to resolve the numerical issues by simply setting method = "inverse", or if that doesn't work, method = "pseudoinverse". These are both slower than the default settings (perhaps prohibitively slow for very large datasets), but they are more numerically stable (with pseudoinverse being the most stable). Hopefully that does the trick, but if that still doesn't work, you might try playing around with the decomp and param arguments (see Details section of ?mvOU).

You can also use mvSIM without having a fitted mvMORPH object. The only challenge (ironically) is the simplest case, when you don't have multiple regimes (like with Rphylopars). In this case, you have to 'trick' the function because mvSIM requires a simmap tree, and simmap trees have to have at least 2 regimes. Fortunately it's straightforward. Simply create a simmap tree with 2 regimes (a and b), with all of the species belonging to the first regime and none of the species belonging to the second regime. Here's an example with 2 traits. The alpha values for the traits are 2 and 3; the sigma rate matrix has variances of 1 with covariance 0.7; the root state (theta) has states -2 and 5. Note for theta that you have to create a theta for each regime (a and b), but the b regime isn't actually used.

regimes <- matrix(c(rep(1,length(tree$tip.label)),rep(0,length(tree$tip.label))),nrow = length(tree$tip.label),ncol = 2,dimnames = list(tree$tip.label,c("a","b")))

regimes

a b

t11 1 0

t49 1 0

t50 1 0

t18 1 0

t47 1 0

...

simtree <- make.simmap(tree = tree,x = regimes,model = "ER",nsim=1)

alpha <- diag(2:3) sigma <- diag(2)*.3 + .7 theta <- matrix(data = c(-2,-2,5,5),nrow = 2,ncol = 2,dimnames = list(c("a","b"),c("trait1","trait2")))

alpha

[,1] [,2]

[1,] 2 0

[2,] 0 3

sigma

[,1] [,2]

[1,] 1.0 0.7

[2,] 0.7 1.0

theta

trait1 trait2

a -2 5

b -2 5

sims <- mvSIM(tree = simtree,model = "OUM",param = list(sigma = sigma,alpha = alpha,theta = theta),nsim = 1)

sims

[,1] [,2]

t11 -1.710623 5.501499

t49 -2.590463 4.517010

t50 -2.548926 4.632217

t18 -1.059882 5.827987

t47 -1.871089 4.704690

t48 -1.816265 5.122190

— Reply to this email directly, view it on GitHub https://github.com/ericgoolsby/Rphylopars/issues/49#issuecomment-1038461052, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKMB23XMEVR64INQRI5PUGLU3AX5JANCNFSM5NVQOEFA. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub. You are receiving this because you authored the thread.

ericgoolsby commented 2 years ago

Ah I see! Glad to hear Rphylopars can handle it. Eight dimensions is tough indeed. If you're interested in why, there are two main reasons:

1) There is no closed-form solution to most phylogenetic comparative methods (with the exception of simple Brownian motion models with exactly one observation per species). That means we have to estimate the trait covariance matrix and the alpha matrix numerically using maximum likelihood optimization. For m traits, that's covariance parameters (e.g. for m=8, 28 parameters), and then either 8 or 28 additional parameters for the alpha matrix (depending on whether it's diagonal or full). So either 36 or 56 parameters total (for comparison, a 4-trait dataset would only require 6 parameters for covariance and either 4 or 6 parameters for alpha -- much much easier to work with). The large number of parameters requires a very large number of likelihood evaluations to try out the many different combinations of parameters. On top of that, sometimes it might not converge on the maximum likelihood solution (i.e., it might get stuck at a local optimum). And if that weren't enough, the matrices also might become unstable and give you numerical issues. 2) Technical difficulties aside, it also just takes a long time to calculate the log-likelihood, because the covariance of all observations is of dimension (n_observations m)x(n_observations m). For instance, if you had 8 traits and 100 species with 3 observations each, that would be a 2400x2400 matrix (5.76 million cells!). So this is quite time-consuming to deal with (especially matrix inversion, which is required for each likelihood evaluation). mvMORPH uses a few tricks like sparsity and Cholesky decomposition to make it more efficient, whereas Rphylopars uses a tree traversal algorithm so instead of dealing with the (3008)x(3008) matrix, it only has to deal with an 8x8 matrix -- once for each individual in the study, and once for each node on the tree. Even with the tree traversal, it can still take a while if the number of traits is large enough.

jakeberv commented 10 months ago

Following up on this after a while. I am wondering if there is a way to allow for shifts in the multivariate process similar to how mvMORPH can model shifts with an underlying SIMMAP style phylogeny. It would be cool to be able to do this with Rphylopars to take advantage of the ability to use individual-level data, rather than species means. Jake