JuliaStats / GLM.jl

Generalized linear models in Julia
Other
593 stars 114 forks source link

Implement `nulldeviance` and `nullloglikelihood` for GLMs #479

Closed nalimilan closed 2 years ago

palday commented 2 years ago

From McCullagh and Nelder: image

For the Binomial, m is the number of Bernoulli trials, i.e. the first parameter in the usual formulation.

nalimilan commented 2 years ago

What am I supposed to conclude from this excerpt? :-D AFAICT the call to devresid is equivalent to these formulas, right? But I'm still not sure how to handle offsets.

palday commented 2 years ago

Yeah, the excerpt was also just for my own note-taking as the closest thing I found to an efficient closed-form formula for the null deviance.

I think offsets are just always awkward to handle. The offset is added to mu, right? Then we have mean(mu+offset) = mean(mu) + mean(offset) and can just use that in all our calculations, which will give the same problems you mentioned in R (https://github.com/JuliaStats/GLM.jl/pull/477#issuecomment-1101745178).

For models without an intercept, the null model is actually the same as forcing the mean to zero, right? I'm not sure if that will work numerically for some distributions, but maybe we should warn/error in those cases.

nalimilan commented 2 years ago

I think offsets are just always awkward to handle. The offset is added to mu, right? Then we have mean(mu+offset) = mean(mu) + mean(offset) and can just use that in all our calculations, which will give the same problems you mentioned in R (#477 (comment)).

Not exactly, but close: the offset is included in eta, i.e. eta = XB + offset.

After some more investigations, I've made some progress: I'm able to compute the null deviance for a Poisson model with a log link and an offset. I'm not sure why it doesn't work for a Normal model with a log link and an offset. I'll see whether I can find a solution, and if not I will at least push what I have. EDIT: the Normal distribution is probably one of the cases that the R docs refer to with "this will be incorrect if the link function depends on the data other than through the fitted mean". Maybe we should just throw an error for distributions with a dispersion parameter, as it's always better than returning incorrect results.

For models without an intercept, the null model is actually the same as forcing the mean to zero, right? I'm not sure if that will work numerically for some distributions, but maybe we should warn/error in those cases.

I wondered the same thing. The null model is defined as the one with only the intercept in the docstring currently. For models without an intercept, it could make sense to define it as the model for which all predicted values are 0. Though these things are confusing, so throwing an error would be safer, at least as long as users don't request that feature. This matters especially because people may call r2 for GLMs without realizing what null model is taken as the reference.

codecov-commenter commented 2 years ago

Codecov Report

Merging #479 (a58f8e6) into master (f149f2b) will increase coverage by 0.28%. The diff coverage is 92.30%.

@@            Coverage Diff             @@
##           master     #479      +/-   ##
==========================================
+ Coverage   87.16%   87.44%   +0.28%     
==========================================
  Files           7        7              
  Lines         849      900      +51     
==========================================
+ Hits          740      787      +47     
- Misses        109      113       +4     
Impacted Files Coverage Δ
src/glmfit.jl 81.25% <92.30%> (+1.95%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update f149f2b...a58f8e6. Read the comment docs.

nalimilan commented 2 years ago

I had code that computes the null deviance/log-likelihood for Poisson models with offsets, but I'm afraid it loses precision significantly in some cases due to the back and forth between operations on the response scale and operations on the linear predictor scale. For a Binomial model with large offsets, this gave an infinite deviance. Given that we need a fallback that does fit the null model for other distributions anyway, I figured it would be simpler to always use it when there are offsets. I've kept the intermediate commits if you want to have a look.

nalimilan commented 2 years ago

Another issue is that for models fitted with negbin, the null model is taken to be the one with the same theta parameter, which will have a worse fit than calling negbin again. Maybe negbin should fill a field in the model object to indicate that theta was chosen automatically so that we throw an error from nulldeviance and nullloglikelihood instead of returning misleading results? (We can't always throw an error for NegativeBinomial as users can specify theta manually.)

nalimilan commented 2 years ago

Given that we now have a clear definition the null model when theere's no intercept in StatsAPI (https://github.com/JuliaStats/StatsAPI.jl/pull/14) and an implementation for linear models (https://github.com/JuliaStats/GLM.jl/pull/481), I've pushed changes to support the no intercept case.

nalimilan commented 2 years ago

@palday Good to go?