Predict neglects offset by default?

stouffer commented 3 years ago

I am hoping to get some clarity on what struck me as counter-intuitive behavior of the predict function given a gllvm model fit. Specifically, if the fit includes an offset, why does predict not include this by default nor include an option with which to control this behavior? This runs contrary to the behavior of similar functions (e.g., glm) which automatically incorporate the same offsets used in model fitting while generating fitted/predicted values.

BertvanderVeen commented 3 years ago

I vaguely recall the offset used to he included by default, so it seems odd that it isn't now. The suggestion of an argument to control the offset sounds like a good idea!

stouffer commented 3 years ago

Thanks for your reply. I've been thinking this over since creating this issue, and I suppose in some senses it might be a matter of perspective as well that, for similar reasons, explains why there are both fitted and predict functions for typical lm or glm fits.

If one imagines that predict is largely targeted toward making well-defined predictions (e.g., with new data), then whatever was being controlled for in the offset (e.g., exposure in a Poisson regression for different sampling times) would be something that one would probably not wish to include because it would move predictions away from a "common currency". On the other hand, including the offset when looking at fitted values makes perfect sense because the factor or factors making up the offset are thought to be manifest in the observations too.

So maybe an easier solution for the package would be to hang on to the fitted values after the model has been fit (or recalculate them), which would enable it to be passed through the base fitted function. For lm and glm this simply extracts the values that are in the object$fitted.values which are saved at the end of the lm.fit and glm.fit, respectively, including any offset when present. Neither gllvm.TMB not gllvm.VM appear to include the estimated mus/etas in the object they return as far as I can tell, but since they get calculated maybe they could be added to the object that gets returned?

BertvanderVeen commented 3 years ago

I appreciate the comparison with linear regression. However, some of the decisions that need to be made in order to calculate/define the fitted values, and to predict on new data, are not trivial for models that include random-effects, unlike in linear regression.

So, though the suggestion you are making is definitely something to consider, and though the predict function is not ideal in its current form, it will take some time to adjust. For example, how would one predict new data with random-effects in the model (i.e., the latent variables)? We can define various "predictions", such as conditional on the predicted LVs (as is the default in the predict function), marginal (the LVs need to be integrated over, this is very difficult), or ... other (such as drawing new deviates from a normal distribution for the latent variables, but for the existing number of sites)? And the same applied to the fitted values. With both comes the issue that integration over the random-effect gives different results on the link and response scale, so that E(g(eta)) \neq g(E(eta)) (i.e., we can't simply transform the linear predictor to the response scale, unless we do as is the current default; a conditional prediction).

So, there are a lot of things to consider, and your suggestion would require a serious overhaul of the current implementation. An option that allows you to choose for/against including an offset is obviously a lot more straightforward to implement.

stouffer commented 3 years ago

I can totally see where you're coming from, though I'd say that there are pretty reasonable precedents for some of these non-trivial questions based on what occurs in predict.merMod in lme4, predict.lme in nlme, or in similar packages. Fundamentally, almost every other package I've encountered returns the fitted values of a model when someone runs predict(object) whereas this one doesn't, and that is what struck me as surprising or worth comment. That said....

After my message from earlier, I realized that I the linear predictor, including all fixed and random effects, gets calculated in residuals.gllvm. Therefore, perhaps an easier suggestion (which I would actually be happy to try to implement myself and submit as a pull request) would be to borrow from that code to produce a fitted.gllvm function that unambiguously returns the fitted values from a model of class gllvm. How does that sound?

BertvanderVeen commented 3 years ago

Of course there are packages that we can look to for comparison. Having said that, the developers of lme4 or nlme make choices that might not be completely suitable for GLLVMs under all circumstances, so that these choices still need to be considered independently here.

And indeed, the linear predictor does get calculated in multiple functions (that is, the linear predictor conditional on the predicted LVs), since it is straightforward to calculate conditional on the predicted LVs. The marginal prediction could in most cases be retrieved from TMB. I also agreed above with you that an option for an offset in the predict function would be helpful. When an option for an offset is implemented in the predict function, it is trivial to provide a wrapper for the predict function to retrieve the fitted values (as that would after all, be the default behavior for predict), so coding that up is not really the issue (predictions for new data are more difficult, as I wrote above).

Regardless, in the end, @JenniNiku is the main developer, and decisions like these are up to her.

JenniNiku commented 3 years ago

Thanks for pointing this out. I included the offset option to predict now, see commit

stouffer commented 3 years ago

Brilliant. Thanks!

JenniNiku / gllvm

Predict neglects offset by default? #48