Closed zeileis closed 2 years ago
My take here is that (G)LMs are really semi-parametric models and so likelihoods for these methods aren't well defined / analytically available.
I'm realizing that in general I would be more comfortable if "prodist" objects actually had a totally different class than distribution
objects, because sampling distributions of estimates are not probability distributions, but really a separate kind of object (see, for example, http://arxiv.org/abs/1707.00486, or [1], or much of Ryan Martin's work on IMs). If we were working with posterior distributions of parameters I'd be comfortable with treating estimates as properly distributional, but for frequentist estimates it seems likely to lead to subtle confusions to me.
[1] Schweder, Tore, and Nils Lid Hjort. Confidence, Likelihood, Probability: Statistical Inference with Confidence Distributions. Cambridge: Cambridge University Press, 2016. https://doi.org/10.1017/CBO9781139046671.
Good points! I think that the more general view on the distributions from the prediction should go into the documentation to clarify what prodist()
currently does and what it doesn't do. It would be nice to have this more general idea implemented but I don't see how we could easily do this at the moment.
Regarding the semiparametric nature: I fully agree for the LM while for the GLM I'm always torn. It certainly can be used as a semiparamteric model but it also is often motivated as a fully parametric distributional model.
For the normal model: I think this would warrant a method to switch between the two estimators. Any thoughts on the better and/or less confusing default?
For the normal model: I think this would warrant a method to switch between the two estimators. Any thoughts on the better and/or less confusing default?
I think either is fine. You could message()
informatively to avoid confusion perhaps.
First of all, thank you for the great work you have put into the package and apologies for my radio silence the last months. I really hope that I will find more time to help from October onwards!
I don't have a strong opinion on the variance estimator of (G)LMs, but I think it would be preferable if it were consistent with the log-likelihood. Hence, I like @zeileis's proposal to make it optional and would set the default to the maximum likelihood estimate?!
Thanks Alex @alexpghayes and Moritz @mnlang for the feedback. There is now a PR (https://github.com/alexpghayes/distributions3/pull/91) that implements both flavors of scale estimates in both prodist.lm
and prodist.glm
. In the latter I made sure that it also works for the stats::Gamma
family, not only the gaussian
family. I like that the logLik()
results can now be replicated via sum(log_pdf(prodist(...), ...))
. Hence, I also added formal tests for the log-likelihood from various models/families.
I also improved the documentation but decided to not go into too much detail. Hopefully there will be a paper that can dive into that in some more detail.
In the new
prodist()
method we return for a model the probability distribution associated with the point estimates from that model. In case of the linear model (or the Gaussian generalized linear model) it is uncontroversial what the point estimate for the location parametermu
is. However, there are two commonly used flavors forsigma
: Either based on the unbiased least squares estimate (division by n-k) or the biased maximum likelihood estimate (division by n).In the
prodist()
methods for lm and glm objects I implemented the least squares estimate because that is what thesummary()
methods for both lm and glm objects report. However, I realized now, that somewhat inconsistently thelogLik()
methods for both objects use the maximum likelihood estimate. Hence, wondered:prodist(..., sigma = "ML")
vs."OLS"
.Alex @alexpghayes maybe you have an opinion on this? Or maybe someone else?
For illustration: In the
summary()
method we have the following.Hence, up to now we have:
But with this we cannot replicate the log-likelihood:
To do so we would need the maximum likelihood estimate: