ScottClaessens / coevolve

coevolve R package for Bayesian dynamic coevolutionary models using Stan
GNU General Public License v2.0
7 stars 0 forks source link

coev_fit() infers the wrong answer from simulated data before standardisation #26

Closed ScottClaessens closed 2 months ago

ScottClaessens commented 3 months ago

With the coev_simulate_coevolution() function, we can simulate the coevolution of multiple variables and test whether coev_fit() is able to recover the true data generating process.

In a previous iteration of the package, the simulation function standardised the continuous variables after the simulation was finished. In a simulation study with this function, I found that the model did pretty well at getting the right answer. However, in order to compare to the "true" $\Delta\theta_{z}$ value, I modified the simulation function to leave the variables unstandardised upon finishing the simulation.

Strangely, when using this new "unstandardised" simulation function, the model now tends to infer the wrong direction of coevolution. Here's a reproducible example for an example where we know that X → Y but not vice versa:

library(coevolve)

# set seed
set.seed(1)

# simulate dataset where we know x -> y
sim <-
  coev_simulate_coevolution(
    n = 30,
    variables = c("x", "y"),
    selection_matrix = matrix(
      c(0.95, 0.00,
        0.90, 0.95),
      nrow = 2,
      ncol = 2,
      byrow = TRUE,
      dimnames = list(c("x", "y"), c("x", "y"))
      ),
    drift = c("x" = 0.05, "y" = 0.05),
    prob_split = 0.05
  )

# fit model without standardising variables
fit1 <-
  coev_fit(
    data = sim$data,
    variables = list(
      x = "normal",
      y = "normal"
    ),
    id = "species",
    tree = sim$tree,
    parallel_chains = 4,
    iter_sampling = 500,
    seed = 1
  )

# standardise variables
sim$data$x <- as.numeric(scale(sim$data$x))
sim$data$y <- as.numeric(scale(sim$data$y))

# fit model to standardised variables
fit2 <-
  coev_fit(
    data = sim$data,
    variables = list(
      x = "normal",
      y = "normal"
    ),
    id = "species",
    tree = sim$tree,
    parallel_chains = 4,
    iter_sampling = 500,
    seed = 1
  )

Here are the resulting $\Delta\theta_{z}$ values for both models. As you can see, only fit2 gives the right answer.

coev_plot_delta_theta(fit1)

image

coev_plot_delta_theta(fit2)

image

@erik-ringen Maybe you can help here? Why do you think the model is doing this? I've tried this with a few different seeds and the same pattern emerges.

erik-ringen commented 3 months ago

This is due to extreme strong prior-likelihood conflict, I think, along with the small sample size. If you run:

var(sim$data$y) / var(sim$data$x)

You'll see that the variance in y is something like 80x greater than that of x. So when we fit the model without standardizing, the prior effect of y on x is vastly greater than vice-versa. So from the model's POV it sees that the two traits co-vary and we've told it that y -> x is much much more likely/of greater magnitude than x -> y, so it settles on the wrong answer. I am thinking in terms of the prior predictive variance.

For example, in a linear regression with slope b, we can think of the prior predictive variance of a predictor variable x on the outcome E[Var(bx)] = Var(b) Var(x). So with a N(0,1) prior on b, the expected prior predictive variance is simply the variance of the predictor. From here, we can see that the ratio of prior predictive variance/sample variance reduces to var(x)/var(y). Similar to R^2, it gives a sense for how much variance, aprior, we expect to be captured by some predictor.

ScottClaessens commented 3 months ago

Thanks @erik-ringen. This makes sense. This raises a couple of questions regarding the package and our paper:

  1. Should we standardise continuous variables by default under the hood? This is tricky because we would normally prefer to leave outcome variables on their natural measurement scale. And we can't do something similar for e.g. positive reals modelled as lognormal. But it's quite likely that people will apply the package to real-world continuous variables with quite different variances, and we wouldn't want them getting the wrong answer. An alternative here is to instead scale the priors under the hood, like we already do for the negative binomial case.
  2. For the simulation in our paper, should we just standardise the variables as we had done before? This would mean that we won't be able to compare to the true value of $\Delta\theta_{z}$ that you had calculated.
erik-ringen commented 3 months ago

Thanks @erik-ringen. This makes sense. This raises a couple of questions regarding the package and our paper:

  1. Should we standardise continuous variables by default under the hood? This is tricky because we would normally prefer to leave outcome variables on their natural measurement scale. And we can't do something similar for e.g. positive reals modelled as lognormal. But it's quite likely that people will apply the package to real-world continuous variables with quite different variances, and we wouldn't want them getting the wrong answer. An alternative here is to instead scale the priors under the hood, like we already do for the negative binomial case.

Other options would be to run a check whether all the variables the user has mapped onto a normal distribution have unit variance, and then give a warning message with explanation telling them to standardize and/or adjust the priors appropriately. In general I think it might be a good practice to output a warning message whenever the default priors are used, suggesting users that they might not be ideal and to do prior predictive checks.

Also, for positive reals I usually scale them by just dividing by the mean/median value.

  1. For the simulation in our paper, should we just standardise the variables as we had done before? This would mean that we won't be able to compare to the true value of Δθz that you had calculated.

I think we can compare to the true value if we are careful. During the Δθz we would have to rescale a bit using the original means and sds.

ScottClaessens commented 3 months ago

We could maybe include an argument, such as coev_fit(..., scale = TRUE), which standardises variables for users. Perhaps this could be set to TRUE as default. If the user sets scale = FALSE, then the function could return the warning you suggested if the variables do not have unit variance on the continuous / log scale.

I'm wondering if we could standardise as usual for "normal" and "student_t" variables and standardise on the log scale for "lognormal" variables?

# simulate normal, student_t, and lognormal variables
n <- 1000
x <- rnorm(n, 0, 2)
y <- brms::rstudent_t(n, 3, 0, 2)
z <- rlnorm(n, 0, 2)
# standardisation approach under the hood if scale = TRUE
as.numeric(scale(x))
as.numeric(scale(y))
as.numeric(exp(scale(log(z)))

I'll go ahead with the standardised simulation for the paper, but keep a track of the means and SDs for the $\Delta\theta_{z}$ calculation.

ScottClaessens commented 2 months ago

I ran this simulation exercise again with a larger sample size (n = 200). The same pattern emerged even with the larger sample size: the model fitted to pre-standardised variables infers the wrong answer. So I'll work on implementing scale = TRUE.

library(coevolve)
vars <- c("x", "y")
set.seed(1)
sim <-
  coev_simulate_coevolution(
    n = 200,
    variables = vars,
    selection_matrix = matrix(
      c(0.90, 0.00,
        0.85, 0.90),
      nrow = 2,
      byrow = TRUE,
      dimnames = list(vars, vars)
    ),
    drift = c("x" = 0.05, "y" = 0.05),
    prob_split = 0.05
  )
# fit model
fit1 <-
  coev_fit(
    data = sim$data,
    variables = list(
      x = "normal",
      y = "normal"
    ),
    id = "species",
    tree = sim$tree,
    parallel_chains = 4,
    seed = 1
  )
coev_plot_delta_theta(fit1)

image

# standardise
sim$data$x <- as.numeric(scale(sim$data$x))
sim$data$y <- as.numeric(scale(sim$data$y))
# fit model
fit2 <-
  coev_fit(
    data = sim$data,
    variables = list(
      x = "normal",
      y = "normal"
    ),
    id = "species",
    tree = sim$tree,
    parallel_chains = 4,
    seed = 1
  )
coev_plot_delta_theta(fit2)

image

ScottClaessens commented 2 months ago

I have implemented the scale argument for the coev_fit() function in f243ded. The argument determines whether continuous and positive real variables are standardised under the hood before model fitting. The argument is set to TRUE by default. If the user sets scale = FALSE, the function returns a warning that the user should be careful to set sensible priors rather than rely on default priors.