florianhartig / DHARMa

Diagnostics for HierArchical Regession Models
http://florianhartig.github.io/DHARMa/
208 stars 22 forks source link

DHARMa not working with lme4::glmer family statsmod::tweedie #265

Closed cperk closed 3 years ago

cperk commented 3 years ago

I'm having trouble getting DHARMa to work for a glmer model with family tweedie. I realize that DHARMa runs for glmmTMB family tweedie, but I would like to use glmer because it allows me to specify var.power, which I don't believe is possible with glmmTMB. See the reproducible example below.

#create toy dataset
df1 <- data.frame(production=c(15,12,10,9,6,8,9,5,3,3,2,1,0,0,0,0), Treatment_Num=c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4), Genotype=c(1,1,2,2,1,1,2,2,1,1,2,2,1,1,2,2))

#run tweedie model
df1_tweedie <- glmer(production ~ Treatment_Num +(1|Genotype),
                     data = df1, family=tweedie(var.power=1.9,link.power=0))

#load DHARMa
require(DHARMa)
#I get an error when I try running this line:
df1_tweedie_res <- simulateResiduals(df1_tweedie)

Screen Shot 2021-03-12 at 7 14 06 PM

florianhartig commented 3 years ago

OK, thanks for reporting this, I get the same result, and the problem is basically what the error message says, which is that no simulate function has been implemented for the tweedie. I have played a bit around with the code, there is double trouble because the tweedie is not directly implemented in lme4, but imported from package statmod, and it seems statsmod did not implement a simulation function for the tweedie, but also lme4 prevents tweedie simulations. To explain:

If you look, e.g., at the return of standard family binomial (type in binomial) you will see that the list of returned objects includes a simulate function

    structure(list(family = "binomial", link = linktemp, linkfun = stats$linkfun, 
        linkinv = stats$linkinv, variance = variance, dev.resids = dev.resids, 
        aic = aic, mu.eta = stats$mu.eta, initialize = initialize, 
        validmu = validmu, valideta = stats$valideta, simulate = simfun), 
        class = "family")

This is not the case when typing in statsmod::tweedie

    structure(list(family = "Tweedie", variance = variance, dev.resids = dev.resids, 
        aic = aic, link = paste("mu^", as.character(lambda), 
            sep = ""), linkfun = linkfun, linkinv = linkinv, 
        mu.eta = mu.eta, initialize = initialize, validmu = validmu, 
        valideta = valideta), class = "family")

Theoretically, you could overwrite this by locally wrapping the tweedie function and adding the simulate function

tweedie <- function (var.power = 0, link.power = 1 - var.power) {
  out = statmod::tweedie(var.power = var.power, link.power = link.power)

  simfun <- function(object, nsim) {
    # my implementation
  }

  out$simulate = simfun
  return(out)
}

but I just tried this, even this wouldn't work because lme4 explicitly checks if simulations are supported in lme4::.simulateFun, and this is not the case for the tweedie.

So, to cut a long story short: to fix this in a way that DHARMa can check a tweedie glmer out of the box, one would definitely have to

As an interim fix: you could create the simulations from the fitted model by hand, as described in https://github.com/florianhartig/DHARMa/wiki/Adding-new-R-packages-to-DHARMA

I will notify the package developers of lme4 and statsmod, and keep this open for a while as a reminder to check if something has happened, but there is nothing else I can do from the DHARMa side about this, it's a limitation of the upstream packages.

@bbolker - could you have a look at this for lme4?

@gksmyth - any thoughts from the statsmod side?

gksmyth commented 3 years ago

I have no plans to implement a tweedie()$simulate function.

Generating random variates from a Tweedie distribution isn't straightforward even with parameters properly specified. Trying to simulate data from within a glm fit without knowing the dispersion parameter seems problematic to me.

While gamma()$simulate and inverse.gaussian()$simulate currently implement ad hoc options to estimate the dispersion parameter on fly, this seems unsatisfying to me, and also somewhat inconsistent because the differerent functions are making different types of ad hoc choices.

florianhartig commented 3 years ago

Hi Gordon, many thanks for your thoughts. Just a few additions to this, feel free to comment:

I see that mgcv::gam with mgcv::Tweedie() basically has the same issue than the combination of lme4::glmer with statsmod::tweedie, i.e. the signature of the link function seems identical, and no simulation is implemented, which is why simulations with DHARMa fail.

I have to say that I still do not fully understand why it's so problematic to generate simulations from fitted tweedie GLM(M)s. I had assumed that the MLE of the fitted model would include the necessary parameters for the simulations (including dispersion), but from Gordon's comments and from the various discussions that I read, I now understand that the dispersion parameter is not directly estimated, and has to be obtained post-hoc. @cperk - if you want to do this by hand, there are various discussions about how to obtain these parameters, e.g.

@cperk - I have no idea how reliable those approximations are, in particular when including REs and all that. It seems @gksmyth is rather skeptical about this. I would have to do simulations myself to check. If you want to use this for your model, maybe test with a few null simulations, i.e. generate data from a Tweedie, fit GLM, check that DHARMa doesn't show any weird patters.

Would be interesting to know how glmmTMB solved the problem, as they obviously provide simulations for fitted tweedie models. Maybe @bbolker can shed some light on that - are those glmmTMB simulations also based on approximations, and could they maybe fail?

bbolker commented 3 years ago

glmmTMB does "true" maximum likelihood estimation (up to numerical accuracy, and up to using the Laplace approximation for random effects), so it really does estimate the mean, dispersion, and power parameter. The relevant code is here (Tweedie PDF in the TMB code); here (called in glmmTMB's likelihood calculation loop); and here (simulation code).

I didn't actually implement or test any of this stuff myself, and I think it's mostly taken from elsewhere: the TMB code says "Re-structured version of tweedie density function from 'cplm' package"; and the simulation code says "Copied from R function tweedie::rtweedie".

florianhartig commented 3 years ago

Hi Ben, OK, that is great to know for sure. I'll try to make some tests for numeric stability but let's assume for the moment glmmTMB works.

But what about lme4, which was the OP's original concern? I understand she would like to use lme4 because of the var.power option which is not available in glmmTMB? Is there any way simulate reliably from lme4 tweedie fits, or could this be implemented in lme4?

bbolker commented 3 years ago

A little hard for me to say because I'm completely unfamiliar with the use of Tweedie in glmer fits. It sounds like what the OP wants to do is to fix the power parameter in a Tweedie fit (e.g., fit a Tweedie model with a specified power parameter rather than estimating it).

ncores <- 8 ## change if desired
set.seed(101)
system.time(dd <- data.frame(
    x=tweedie::rtweedie(n=1000, power=1.5, mu=2, phi=1)))
library(glmmTMB)
system.time(m0 <- glmmTMB(x~1, data=dd, family=tweedie,
                          control=glmmTMBControl(parallel=ncores)))
tpow_invlink <- function(x) { 1 + plogis(x)}
tpow_link <- function(x) { qlogis(x-1) }
stopifnot(isTRUE(all.equal(1.75,tpow_link(tpow_invlink(1.75)))))
tpow_invlink(m0$fit$par[["thetaf"]]) ## 1.488
m1 <- update(m0,map=list(thetaf=factor(NA)),
             start = list(thetaf=tpow_link(1.75)))

Problems with this:

But it might be sufficient ... ?

Actually reading the original post, it looks like this will work:

df1_tweedie <- glmmTMB(production ~ Treatment_Num +(1|Genotype),
                       data = df1,
                       family=tweedie(link="log"),
                       map=list(thetaf=factor(NA)),
                       start = list(thetaf=tpow_link(1.9)))
library(DHARMa)

df1_tweedie_res <- simulateResiduals(df1_tweedie)
plot(df1_tweedie_res)

Looks ugly as heck, of course, but presumably this is a toy example.

florianhartig commented 3 years ago

Hi Ben, wow, this is neat, I didn't know about map, although it is clearly documented of course. OK, for completeness, this would be the code that worked for me (I had to modify glmmTMB::tweedie as I had statsmod loaded before)

tpow_link <- function(x) { qlogis(x-1) }
df1_tweedie <- glmmTMB(production ~ Treatment_Num +(1|Genotype),
                       data = df1,
                       family=glmmTMB::tweedie(link="log"),
                       map=list(thetaf=factor(NA)),
                       start = list(thetaf=tpow_link(1.9)))

What is a bit inconvenient is that summary(df1_tweedie) didn't work for me, I presume because of the constant theta = NA which is probably something that the summary is looking for, but the parameter estimates in df1_tweedie are virtually identical to glmer, so I assume this is indeeds the same model. So, it seems to me immediate problem solved (as one could check residuals in glmmTMB, but then take the regression table from glmer).

It would still be nice to have simulations for lme4 and mgcv as well, as I suspect that there is a sizeable number of people that would prefer to work with these packages (I guess for mgcv there is no alternative), but OK, that's up to the respective package developers then.

gksmyth commented 3 years ago

Let me make some comments on the TMB code and MLE:

  1. TMB does not implement MLE for the whole Tweedie family but only for a sub-family with variance power between 1 and 2.

  2. Maximum likelihood is not a satisfactory method for estimating the dispersion for the purposes of simulation because it is biased and it will generate deviates that are systematically less dispersed than the original observations.

  3. The power parameter is already fixed once a tweedie glm is formed. Reestimating a power parameter from within a Tweedie glm will produce unpredictable and incorrect results.

  4. The TMB code is based on the cplm package, which is based on the tweedie package (written by Pete Dunn and me), which is based on equations from Dunn & Smyth (2005). So I know what the code is doing.

cperk commented 3 years ago

Thank you all for your help! I want to be able to specify var.power because seems to allow me to really choose the distribution for the data I'm modeling, which is continuous data with a few zeros:

The most interesting Tweedie families occur for var. power between 1 and 2. For these GLMs, the response distribution has mass at zero (i.e., it has exact zeros) but is otherwise continuous on the positive real numbers (Smyth, 1996; Hasan et al, 2012). ... power between 0 and 1 (Jorgensen 1987).

http://finzi.psych.upenn.edu/R/library/statmod/html/tweedie.html#:~:text=The%20most%20interesting%20Tweedie%20families,Hasan%20et%20al%2C%202012).&text=power%20between%200%20and%201%20(Jorgensen%201987 ).

On Sun, Mar 14, 2021 at 5:02 PM Ben Bolker @.***> wrote:

A little hard for me to say because I'm completely unfamiliar with the use of Tweedie in glmer fits. It sounds like what the OP wants to do is to fix the power parameter in a Tweedie fit (e.g., fit a Tweedie model with a specified power parameter rather than estimating it).

ncores <- 8 ## change if desired set.seed(101) system.time(dd <- data.frame( x=tweedie::rtweedie(n=1000, power=1.5, mu=2, phi=1))) library(glmmTMB) system.time(m0 <- glmmTMB(x~1, data=dd, family=tweedie, control=glmmTMBControl(parallel=ncores)))tpow_invlink <- function(x) { 1 + plogis(x)}tpow_link <- function(x) { qlogis(x-1) } stopifnot(isTRUE(all.equal(1.75,tpow_link(tpow_invlink(1.75))))) tpow_invlink(m0$fit$par[["thetaf"]]) ## 1.488m1 <- update(m0,map=list(thetaf=factor(NA)), start = list(thetaf=tpow_link(1.75)))

Problems with this:

  • I'm not quite sure this is what the OP wants.
  • it requires detailed knowledge of the guts (i.e. thetaf=name of power parameter, this parameter is fitted on the logit(x-1) scale)
  • output of the model shows the power parameter as NA rather than the specified 1.75 value

But it might be sufficient ... ?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/florianhartig/DHARMa/issues/265#issuecomment-798979223, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFCT4K6XT737IXOBC72Q6ULTDUP67ANCNFSM4ZDF3CYQ .

bbolker commented 3 years ago

Thanks!

bbolker commented 3 years ago

@cperk, if you're going to specify a fixed value for the power parameter (var.power), how are you going to choose it? I agree that Tweedie is useful for positive data with a few zeros, but it's not obvious to me why you wouldn't estimate the power parameter alongside the other parameters (unless you're data-limited and have a sensible a priori guess). (Maybe it's like the t-distribution, where the shape parameter is theoretically estimable but often really problematic/weakly identifiable in practice?)

gksmyth commented 3 years ago

Dear Florian,

My reason for saying that a statmod::tweedie()$simulate function is problematic is not because it can't be done (I could write such a function in 10 minutes) but because:

  1. Such a function would be slow and would be numerically unstable for non-trivial choices of parameters;
  2. I disagree with the whole concept of imbedding a simulate function within a glm family. It is not the right place for it.
  3. I do not want to maintain such code nor to defend the purposes that people might put it to.

It is fundamental glm theory that the MLE fit is for the coefficients only, not the dispersion parameter. In R, the dispersion parameter is instead estimated by the summary and anova functions. summary does not use MLE for the dispersion but instead uses a Pearson estimator, which is less biased and more robust than MLE for the dispersion.

I myself recommend the Pearson estimator for most purposes. Conditional likelihood estimators are theoretically better (i.e., have smaller variance) but are numerically difficult (and are also against the spirit of glms).

The stackexchange and github links you give are not so much discussions of how to estimate the dispersion parameter as simply references to the well known fact that summary applied to a glm object returns the (Pearson) dispersion estimator.

Simulating data in a GLMM context is presumably quite different from a GLM context. The dispersion estimators would be quite different also.

florianhartig commented 3 years ago

OK, many thanks for all your thoughts, this was a very fruitful discussion (at least for me). I think there is a certain danger that this thread is diverting into too many subtopics, so I would like to recapitulate / wrap up by saying:

The more fundamental problem that has emerged for me is if checks of Tweedie GLMs with glmmTMB (or other models) are reliable. Gordon, first of all, many thanks for your explanations, and I understand and agree with the points, and also why you don't want to implement such a function on statmod::tweedie. With the comments that you made, I agree that it would probably be better placed in lme4 anyway.

From the viewpoint of DHARMa, I would say that there is no issue here. I rely on the simulations that I get from the packages, and if glmmTMB simulates from the MLE, and this MLE is biased, then a residual pattern in DHARMa is not necessarily something that I would find undesirable, as it highlights a bias that exists in the model. However, one should probably alert the user to the fact that those problems may be normal for a Tweedie fit. I will run some simulations to see how this plays out in real world data.

As a side note: DHARMa has a second, more time-consuming option of calculating residuals under a full parametric bootstrap (i.e., rather than comparing simulated to observed data, I compare original to re-fitted residuals) for exactly this type of problem (biased estimators). I am not sure if this helps for this particular dispersion problem, but I will also check this out.

But this is stuff for an extra thread, I think I have trespassed enough on your time, many thanks again for the comments!

florianhartig commented 3 years ago

OK, I created a new issue for exploring the bias question, you can subscribe to that if you want to get updates, and else I would say @cperk feel free to close this, I had the feeling that your immediate problem is solved (with all the new caveats that we unraveled instead).

bbolker commented 3 years ago

Feel free to open a glmmTMB issue about the NA value in summary() when the parameter is fixed, if I don't get around to it.

cperk commented 3 years ago

Thank you @bbolker for your excellent solution to my issue! I'm excited to try it, though I will take the output with a grain of salt because I understand that the simulations may not be reliable.

I know this discussion is winding down, but I wanted to answer your question as to how I would get the optimum power parameter value. I had planned to use maximum likelihood as described in this blog post:

https://towardsdatascience.com/insurance-risk-pricing-tweedie-approach-1d71207268fc

#toy dataset from my original post
df1 <- data.frame(production=c(15,12,10,9,6,8,9,5,3,3,2,1,0,0,0,0), Treatment_Num=c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4), Genotype=c(1,1,2,2,1,1,2,2,1,1,2,2,1,1,2,2))

#try a range of power values:

out.toy <- tweedie.profile( production ~ Treatment_Num +(1|Genotype),
                            data = df1,link.power = 0,do.smooth = TRUE, do.plot = TRUE, verbose = TRUE)

max_lik_toy_march15

The optimum value of p would be the point at which the curve flattens out. The toy data isn't a great example because the curve doesn't flatten out. The graph in the blog post provides a better visual.

bbolker commented 3 years ago

You don't need to do this, glmmTMB will do it for you!

library(glmmTMB)
library(insuranceData)
library(tweedie)
data(dataCar)
system.time(est_p <-tweedie.profile(claimcst0 ~ veh_value+veh_body+veh_age+ gender+area+agecat,
                                    data=dataCar,link.power = 0,do.smooth = TRUE, do.plot = TRUE))
est_p$xi.max

system.time(fit <-glmmTMB(claimcst0 ~ veh_value+veh_body+veh_age+ gender+area+agecat,
                          data=dataCar,
                          control=glmmTMBControl(parallel=2),
                          family=glmmTMB::tweedie))
bbolker commented 3 years ago

For your toy data, glmmTMB runs into trouble because it tries to make the power parameter equal to 1 (which is the boundary of the feasible space).

florianhartig commented 3 years ago

Thanks again for all your comments. I'll close this now and we'll continue discussions about possible problems with a MLE bias in the new thread!