gavinsimpson / gratia

ggplot-based graphics and useful functions for GAMs fitted using the mgcv package
https://gavinsimpson.github.io/gratia/
Other
206 stars 28 forks source link

simulate.gam with ocat returning unexpected values #319

Open hhp94 opened 3 weeks ago

hhp94 commented 3 weeks ago

In simulate-methods.R in gratia, for ocat models, predict(type = "response") currently returns the probability of each category. However, the fix.family.rd function expects a vector of linear predictor values instead. This discrepancy leads to incorrect handling in the simulation.gam output for ocat.

I've pasted the mgcv::fix.family.rd function for the ocat family here for reference:

function (mu, wt, scale) 
{
  theta <- get(".Theta")
  R = length(theta) + 2
  alpha <- rep(0, R + 1)
  alpha[1] <- -Inf
  alpha[R + 1] <- Inf
  alpha[2] <- -1
  if (R > 2) {
    ind <- 3:R
    # looks like alpha is the vector of cut points along logistic distribution
    alpha[ind] <- alpha[2] + cumsum(exp(theta))
  }
  y <- u <- runif(length(mu))
  # mu should be on the same scale as logit(u)?
  u <- mu + log(u/(1 - u))
  for (i in 1:R) {
    y[u > alpha[i] & u <= alpha[i + 1]] <- i
  }
  y
}

Reproducible Example

library(gratia)
library(mgcv)
d <- data_sim(dist = "ocat")
m <- gam(y ~ s(x0), data = d, family = ocat(R = 4), method = "REML")
sim <- simulate(m, data = d[1, ])
attr(sim, "seed") <- NULL
sim

In this example, sim is a $4 \times 1$ matrix because fix.family.rd interpreted the output of predict(type = "response"), which is Pr(y = 1, 2, 3, 4), as linear predictor values, leading to unexpected results. I'm fairly sure this is a bug. If you agree, let me know how you'd like to fix it and I can open a PR. Should predict(type = "link") be used in simulate.gam instead?

Thank you so much for teaching me about gam and {mgcv}.

gavinsimpson commented 3 weeks ago

Thanks for the report. This is one of those unintentional things; I didn't know the ocat() had an $rd component. These non-standard families need specialist support and that's something I hadn't gotten to yet (my recollection was that none of them had $rd in the family).

This is a bug (I should probably check to see if a particular family is supported and throw an error for those that don't work). And yes, the fix will be to pass in a linear predictor value for each observation we're simulating for. I don't think the fix is that trivial however; I will need to handle all the other non-standard families and some of those will require different handling to that needed for ocat(). I have begun to tackle this in fitted_values(), where I have a list of standard families that just work normally and then other families that need special handling through a pre- or post-processing step. This is how I'm handling ocat() in fitted_values() for example. I'll need to think how best to do this in simulate() and also try to not duplicate code as the details of all these parameterisations for all the families makes my head hurt from time to time.

hhp94 commented 3 weeks ago

Oh, I’d be happy to help! To be honest, I'm still getting familiar with posterior simulations on a theoretical level, so working with your implementations in gratia is a great opportunity for me to learn. Nevertheless, I'll go through your vignette more carefully, review the fitted_values code, and see what might be the best approach. In the meantime, if you have an idea and would like help implementing it, just let me know!