Closed ScottClaessens closed 2 months ago
This is due to extreme strong prior-likelihood conflict, I think, along with the small sample size. If you run:
var(sim$data$y) / var(sim$data$x)
You'll see that the variance in y is something like 80x greater than that of x. So when we fit the model without standardizing, the prior effect of y on x is vastly greater than vice-versa. So from the model's POV it sees that the two traits co-vary and we've told it that y -> x is much much more likely/of greater magnitude than x -> y, so it settles on the wrong answer. I am thinking in terms of the prior predictive variance.
For example, in a linear regression with slope b, we can think of the prior predictive variance of a predictor variable x on the outcome E[Var(bx)] = Var(b) Var(x). So with a N(0,1) prior on b, the expected prior predictive variance is simply the variance of the predictor. From here, we can see that the ratio of prior predictive variance/sample variance reduces to var(x)/var(y). Similar to R^2, it gives a sense for how much variance, aprior, we expect to be captured by some predictor.
Thanks @erik-ringen. This makes sense. This raises a couple of questions regarding the package and our paper:
Thanks @erik-ringen. This makes sense. This raises a couple of questions regarding the package and our paper:
- Should we standardise continuous variables by default under the hood? This is tricky because we would normally prefer to leave outcome variables on their natural measurement scale. And we can't do something similar for e.g. positive reals modelled as lognormal. But it's quite likely that people will apply the package to real-world continuous variables with quite different variances, and we wouldn't want them getting the wrong answer. An alternative here is to instead scale the priors under the hood, like we already do for the negative binomial case.
Other options would be to run a check whether all the variables the user has mapped onto a normal distribution have unit variance, and then give a warning message with explanation telling them to standardize and/or adjust the priors appropriately. In general I think it might be a good practice to output a warning message whenever the default priors are used, suggesting users that they might not be ideal and to do prior predictive checks.
Also, for positive reals I usually scale them by just dividing by the mean/median value.
- For the simulation in our paper, should we just standardise the variables as we had done before? This would mean that we won't be able to compare to the true value of Δθz that you had calculated.
I think we can compare to the true value if we are careful. During the Δθz we would have to rescale a bit using the original means and sds.
We could maybe include an argument, such as coev_fit(..., scale = TRUE)
, which standardises variables for users. Perhaps this could be set to TRUE as default. If the user sets scale = FALSE
, then the function could return the warning you suggested if the variables do not have unit variance on the continuous / log scale.
I'm wondering if we could standardise as usual for "normal" and "student_t" variables and standardise on the log scale for "lognormal" variables?
# simulate normal, student_t, and lognormal variables
n <- 1000
x <- rnorm(n, 0, 2)
y <- brms::rstudent_t(n, 3, 0, 2)
z <- rlnorm(n, 0, 2)
# standardisation approach under the hood if scale = TRUE
as.numeric(scale(x))
as.numeric(scale(y))
as.numeric(exp(scale(log(z)))
I'll go ahead with the standardised simulation for the paper, but keep a track of the means and SDs for the $\Delta\theta_{z}$ calculation.
I ran this simulation exercise again with a larger sample size (n = 200). The same pattern emerged even with the larger sample size: the model fitted to pre-standardised variables infers the wrong answer. So I'll work on implementing scale = TRUE
.
library(coevolve)
vars <- c("x", "y")
set.seed(1)
sim <-
coev_simulate_coevolution(
n = 200,
variables = vars,
selection_matrix = matrix(
c(0.90, 0.00,
0.85, 0.90),
nrow = 2,
byrow = TRUE,
dimnames = list(vars, vars)
),
drift = c("x" = 0.05, "y" = 0.05),
prob_split = 0.05
)
# fit model
fit1 <-
coev_fit(
data = sim$data,
variables = list(
x = "normal",
y = "normal"
),
id = "species",
tree = sim$tree,
parallel_chains = 4,
seed = 1
)
coev_plot_delta_theta(fit1)
# standardise
sim$data$x <- as.numeric(scale(sim$data$x))
sim$data$y <- as.numeric(scale(sim$data$y))
# fit model
fit2 <-
coev_fit(
data = sim$data,
variables = list(
x = "normal",
y = "normal"
),
id = "species",
tree = sim$tree,
parallel_chains = 4,
seed = 1
)
coev_plot_delta_theta(fit2)
I have implemented the scale
argument for the coev_fit()
function in f243ded. The argument determines whether continuous and positive real variables are standardised under the hood before model fitting. The argument is set to TRUE by default. If the user sets scale = FALSE
, the function returns a warning that the user should be careful to set sensible priors rather than rely on default priors.
With the
coev_simulate_coevolution()
function, we can simulate the coevolution of multiple variables and test whethercoev_fit()
is able to recover the true data generating process.In a previous iteration of the package, the simulation function standardised the continuous variables after the simulation was finished. In a simulation study with this function, I found that the model did pretty well at getting the right answer. However, in order to compare to the "true" $\Delta\theta_{z}$ value, I modified the simulation function to leave the variables unstandardised upon finishing the simulation.
Strangely, when using this new "unstandardised" simulation function, the model now tends to infer the wrong direction of coevolution. Here's a reproducible example for an example where we know that X → Y but not vice versa:
Here are the resulting $\Delta\theta_{z}$ values for both models. As you can see, only
fit2
gives the right answer.@erik-ringen Maybe you can help here? Why do you think the model is doing this? I've tried this with a few different seeds and the same pattern emerges.