replace log normal with gamma - Githubissues

ScottClaessens / coevolve

coevolve R package for Bayesian dynamic coevolutionary models using Stan

GNU General Public License v2.0

7 stars 0 forks source link

replace log normal with gamma #56

Closed ErikRingen closed 1 month ago

ErikRingen commented 1 month ago

In my consideration of how to handle to properly handle Gaussian responses, I also realized that the log-normal implementation is incorrect, because of the non-linear relationship between the location and scale components for the log normal distribution. Rather than inheriting sigma_drift, the eta values should be updated with the terminal drift parameters, just as we would for any other non-Gaussian/student-t response. Then in the likelihood the the log normal has a separate parameter, scale. However, in practice I found the convergence of such a model to be poor. So, I suggest for we switch to gamma as our offering for a positive continuous likelihood, at least for now.

We can use the same parameterization and default priors as brms for the shape parameters:

shape ~ gamma(0.01, 0.01) gamma_lpdf(y[i] | shape, shape / exp(eta[t,tip_id[i]]))

ScottClaessens commented 1 month ago

Yep, I have thought about this as well. Good idea to shift to Gamma. Should we leave this until we have made the more substantive changes to the drift terms?

ErikRingen commented 1 month ago

Yeah let's prioritize the drift change. This change is comparatively simple.

ScottClaessens commented 1 month ago

I'm happy to work on this tomorrow :)

ScottClaessens commented 1 month ago

@ErikRingen I'm implementing this now. When it comes to scale = TRUE, how should we approach scaling positive reals that users model with the gamma distribution? I could do exp(scale(log(x))) which is what I had done for lognormal variables. Do you think this makes sense?

ScottClaessens commented 1 month ago

Or alternatively scale(x, center = FALSE), which divides by the SD but ensures that all values remain positive. I think I will probably go with this for now, but let me know if you have a better idea.

ErikRingen commented 1 month ago

For positive reals I always scale by either the mean or the max value, with no centering. So expressing the data as either a fraction of the mean or fraction of the max.

ScottClaessens commented 1 month ago

Scaling by the maximum value probably seems sensible here, given that users might enter variables with very long tails?

ScottClaessens commented 1 month ago

Then regarding the link function, we could either use the log-link or the softplus-link that we have been using for Poisson and negative binomial variables? I'll go with the log link for now, but happy to change if preferred.

ErikRingen commented 1 month ago

I think log link is fine at least to start with. But for this question and the max/mean scaling, I think we need to simulate from the prior to check reasonableness. Because if we do max scaling and the default priors predict many values near/greater than the sample max, that is clear misspecification of the model.

ScottClaessens commented 1 month ago

Okay, good idea. What I'll do is implement the max scaling for now, and then do some prior predictive checks once everything is up and running. Then we can either change the prior or changing the way the data are scaled.

ScottClaessens commented 1 month ago

So I'm checking the gamma(0.01, 0.01) prior for the gamma response distribution. It looks like prior predictions are clustering around 1. So perhaps we should do mean scaling instead of max scaling? What do you think @ErikRingen?

library(coevolve)
n <- 20
tree <- ape::rcoal(n)
d <- 
  data.frame(
    id = tree$tip.label,
    x = rgamma(n, shape = 1, rate = 0.1),
    y = rgamma(n, shape = 1, rate = 0.1)
  )
fit <-
  coev_fit(
    data = d,
    variables = list(
      x = "gamma_log",
      y = "gamma_log"
    ),
    id = "id",
    tree = tree,
    prior_only = TRUE,
    adapt_delta = 0.99,
    parallel_chains = 4
  )
coev_plot_predictive_check(fit)$x +
  geom_vline(xintercept = 1) +
  scale_x_log10(
    name = "response variable (log scale)",
    limits = c(0.0001, 10000)
    )

ErikRingen commented 1 month ago

Yeah I would try mean scaling instead.

ScottClaessens commented 1 month ago

Great, I've reverted to mean scaling. I'll open a pull request.

ScottClaessens commented 1 month ago

Implemented in ac8f353.