alexpghayes / distributions3

Probability Distributions as S3 Objects
https://alexpghayes.github.io/distributions3/
Other
100 stars 16 forks source link

Predicted sigma from lm/glm #89

Closed zeileis closed 2 years ago

zeileis commented 2 years ago

In the new prodist() method we return for a model the probability distribution associated with the point estimates from that model. In case of the linear model (or the Gaussian generalized linear model) it is uncontroversial what the point estimate for the location parameter mu is. However, there are two commonly used flavors for sigma: Either based on the unbiased least squares estimate (division by n-k) or the biased maximum likelihood estimate (division by n).

In the prodist() methods for lm and glm objects I implemented the least squares estimate because that is what the summary() methods for both lm and glm objects report. However, I realized now, that somewhat inconsistently the logLik() methods for both objects use the maximum likelihood estimate. Hence, wondered:

Alex @alexpghayes maybe you have an opinion on this? Or maybe someone else?

For illustration: In the summary() method we have the following.

m <- lm(dist ~ speed, data = cars)
c(summary(m)$sigma, summary(m)$sigma^2)
## [1]  15.37959 236.53169
summary(m)
## ...
## Residual standard error: 15.38 on 48 degrees of freedom
## [...]
summary(glm(dist ~ speed, data = cars))
## [...]
## (Dispersion parameter for gaussian family taken to be 236.5317)
## [...]

Hence, up to now we have:

pd <- prodist(m)
pd[1]
## "Normal distribution (mu = -1.849, sigma = 15.38)"

But with this we cannot replicate the log-likelihood:

logLik(m)
## 'log Lik.' -206.5784 (df=3)
log_likelihood(pd, cars$dist)
## [1] -206.599

To do so we would need the maximum likelihood estimate:

pd$sigma <- sqrt(mean(residuals(m)^2))
 pd[1]
## "Normal distribution (mu = -1.849, sigma = 15.07)" 
log_likelihood(pd, cars$dist)
## [1] -206.5784
alexpghayes commented 2 years ago

My take here is that (G)LMs are really semi-parametric models and so likelihoods for these methods aren't well defined / analytically available.

I'm realizing that in general I would be more comfortable if "prodist" objects actually had a totally different class than distribution objects, because sampling distributions of estimates are not probability distributions, but really a separate kind of object (see, for example, http://arxiv.org/abs/1707.00486, or [1], or much of Ryan Martin's work on IMs). If we were working with posterior distributions of parameters I'd be comfortable with treating estimates as properly distributional, but for frequentist estimates it seems likely to lead to subtle confusions to me.

[1] Schweder, Tore, and Nils Lid Hjort. Confidence, Likelihood, Probability: Statistical Inference with Confidence Distributions. Cambridge: Cambridge University Press, 2016. https://doi.org/10.1017/CBO9781139046671.

zeileis commented 2 years ago

Good points! I think that the more general view on the distributions from the prediction should go into the documentation to clarify what prodist() currently does and what it doesn't do. It would be nice to have this more general idea implemented but I don't see how we could easily do this at the moment.

Regarding the semiparametric nature: I fully agree for the LM while for the GLM I'm always torn. It certainly can be used as a semiparamteric model but it also is often motivated as a fully parametric distributional model.

For the normal model: I think this would warrant a method to switch between the two estimators. Any thoughts on the better and/or less confusing default?

alexpghayes commented 2 years ago

For the normal model: I think this would warrant a method to switch between the two estimators. Any thoughts on the better and/or less confusing default?

I think either is fine. You could message() informatively to avoid confusion perhaps.

mnlang commented 2 years ago

First of all, thank you for the great work you have put into the package and apologies for my radio silence the last months. I really hope that I will find more time to help from October onwards!

I don't have a strong opinion on the variance estimator of (G)LMs, but I think it would be preferable if it were consistent with the log-likelihood. Hence, I like @zeileis's proposal to make it optional and would set the default to the maximum likelihood estimate?!

zeileis commented 2 years ago

Thanks Alex @alexpghayes and Moritz @mnlang for the feedback. There is now a PR (https://github.com/alexpghayes/distributions3/pull/91) that implements both flavors of scale estimates in both prodist.lm and prodist.glm. In the latter I made sure that it also works for the stats::Gamma family, not only the gaussian family. I like that the logLik() results can now be replicated via sum(log_pdf(prodist(...), ...)). Hence, I also added formal tests for the log-likelihood from various models/families.

I also improved the documentation but decided to not go into too much detail. Hopefully there will be a paper that can dive into that in some more detail.