Closed DominiqueMakowski closed 3 years ago
There is no "type" for linear models, response = link. In this particular case (i.e. lm) type lets you return predictions for response or terms.
type = "response" for binomial returns the predicted probabilities, while for type = "link", the predicted log-odds (I guess). So yes, there might be a prediction interval for binomial case and type = "response", or am I mistaken here?
pinging @mattansb as well.
There is no "type" for linear models
the "type" I'm referring too is unrelated to the "type" arg of stats::predict(), it's just our conceptualization (but we might want to find a better name for it; initially i thought predict = "link"/"response"
but then predict is a function)
maybe what = "link"/"response"
? Though I'm not a big fan of "what"-type args... but method
doesn't really apply...
Technically you can get a prediction interval for non-gaussian models:
library(magrittr)
mod <- glm(cyl ~ mpg, mtcars, family = poisson())
predict(mod, type = "response") %>%
head()
#> Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive
#> 5.748547 5.748547 5.305022 5.646886
#> Hornet Sportabout Valiant
#> 6.369647 6.542429
predict.lm(mod, interval = "confidence", type = "response") %>%
poisson()$linkinv() %>%
head()
#> fit lwr upr
#> Mazda RX4 5.748547 5.418348 6.098870
#> Mazda RX4 Wag 5.748547 5.418348 6.098870
#> Datsun 710 5.305022 4.954377 5.680485
#> Hornet 4 Drive 5.646886 5.313438 6.001260
#> Hornet Sportabout 6.369647 6.033157 6.724905
#> Valiant 6.542429 6.195060 6.909276
predict.lm(mod, interval = "prediction", type = "response") %>%
poisson()$linkinv() %>%
head()
#> Warning in predict.lm(mod, interval = "prediction", type = "response"): predictions on current data refer to _future_ responses
#> Warning in predict.lm(mod, interval = "prediction", type = "response"): assuming prediction variance inversely proportional to weights used for fitting
#> fit lwr upr
#> Mazda RX4 5.748547 4.157805 7.947895
#> Mazda RX4 Wag 5.748547 4.157805 7.947895
#> Datsun 710 5.305022 3.781479 7.442396
#> Hornet 4 Drive 5.646886 4.071551 7.831739
#> Hornet Sportabout 6.369647 4.683892 8.662114
#> Valiant 6.542429 4.829782 8.862383
mod <- glm(am ~ mpg, mtcars, family = binomial())
predict(mod, type = "response") %>%
head()
#> Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive
#> 0.4610951 0.4610951 0.5978984 0.4917199
#> Hornet Sportabout Valiant
#> 0.2969009 0.2599331
predict.lm(mod, interval = "confidence", type = "response") %>%
binomial()$linkinv() %>%
head()
#> fit lwr upr
#> Mazda RX4 0.4610951 0.2518249 0.6850404
#> Mazda RX4 Wag 0.4610951 0.2518249 0.6850404
#> Datsun 710 0.5978984 0.3306147 0.8174011
#> Hornet 4 Drive 0.4917199 0.2707331 0.7159905
#> Hornet Sportabout 0.2969009 0.1387417 0.5253735
#> Valiant 0.2599331 0.1126400 0.4928556
predict.lm(mod, interval = "prediction", type = "response") %>%
binomial()$linkinv() %>%
head()
#> Warning in predict.lm(mod, interval = "prediction", type = "response"): predictions on current data refer to _future_ responses
#> Warning in predict.lm(mod, interval = "prediction", type = "response"): assuming prediction variance inversely proportional to weights used for fitting
#> fit lwr upr
#> Mazda RX4 0.4610951 0.013685038 0.9813997
#> Mazda RX4 Wag 0.4610951 0.013685038 0.9813997
#> Datsun 710 0.5978984 0.021226422 0.9902866
#> Hornet 4 Drive 0.4917199 0.015531309 0.9834226
#> Hornet Sportabout 0.2969009 0.004739448 0.9739891
#> Valiant 0.2599331 0.003264507 0.9741371
Created on 2021-02-28 by the reprex package (v1.0.0)
Which represent the same thing: what is the range of expected rates (Poisson) / probabilities (binomial) sampled from the (conditional) population.
You can think of this as equivalent to CIs from simulated data from Bayesian posteriors.
So you have:
I don't like transform
because it can often imply using the delta-method for CIs... Which is a whole other can of worms, I'm not sure we want to open.
Do we go for not-having a frequentist equivalent of posterior_predict
and hiding this option from the Bayesian models? i.e., an option that makes the predictions to be only 0s and 1s? Or does it still make sense?
head(rstanarm::posterior_predict(rstanarm::stan_glm(vs ~ wt, data = mtcars, family = "binomial", refresh = 0, seed = 333)))
#> Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive Hornet Sportabout
#> [1,] 1 0 0 0 0
#> [2,] 1 1 1 0 0
#> [3,] 0 1 0 0 0
#> [4,] 1 1 1 0 1
#> [5,] 1 0 1 0 0
#> [6,] 1 1 1 0 0
#> Valiant Duster 360 Merc 240D Merc 230 Merc 280 Merc 280C Merc 450SE
#> [1,] 0 0 1 0 0 0 0
#> [2,] 0 0 1 0 1 0 0
#> [3,] 1 0 1 0 0 0 1
#> [4,] 0 1 1 0 0 1 0
#> [5,] 0 0 0 0 0 0 0
#> [6,] 0 0 1 1 0 0 0
#> Merc 450SL Merc 450SLC Cadillac Fleetwood Lincoln Continental
#> [1,] 0 1 0 0
#> [2,] 0 0 0 0
#> [3,] 0 0 0 0
#> [4,] 0 0 0 0
#> [5,] 0 0 0 0
#> [6,] 0 1 0 0
#> Chrysler Imperial Fiat 128 Honda Civic Toyota Corolla Toyota Corona
#> [1,] 0 1 1 0 1
#> [2,] 0 1 1 1 1
#> [3,] 0 1 1 1 1
#> [4,] 0 0 1 1 0
#> [5,] 0 1 1 1 1
#> [6,] 0 1 1 1 1
#> Dodge Challenger AMC Javelin Camaro Z28 Pontiac Firebird Fiat X1-9
#> [1,] 0 0 0 0 0
#> [2,] 0 1 0 0 1
#> [3,] 0 0 0 0 0
#> [4,] 0 0 0 0 1
#> [5,] 0 0 0 0 1
#> [6,] 0 1 1 1 1
#> Porsche 914-2 Lotus Europa Ford Pantera L Ferrari Dino Maserati Bora
#> [1,] 1 1 1 1 0
#> [2,] 1 1 1 0 0
#> [3,] 0 1 1 1 0
#> [4,] 1 1 0 1 0
#> [5,] 1 1 0 1 0
#> [6,] 1 1 0 1 1
#> Volvo 142E
#> [1,] 1
#> [2,] 1
#> [3,] 1
#> [4,] 0
#> [5,] 1
#> [6,] 0
Created on 2021-02-28 by the reprex package (v0.3.0)
I think we need to separate "predicting the central tendency in the population (confidence interval) / future samples (prediction interval)" from simulating actual observations implied by the model.
by means of an argument? or by a different function?
I think a different function. Don't we have some "simulate posterior" function? That can be expanded to also simulate the population.
Why do we need to re-invent the wheel, when there's already a simulate()
method? We could of course have a "stable generic", like get_simulated()
, but for common model types this is just a wrapper around simulate()
. I don't quite understand what you're missing.
Agreed π
so to summarize:
get_predicted()
returns the central tendency in the population (confidence interval) or future samples (prediction interval), depending on the type
of prediction. simulate()
/ get_simulations()
would generate actual observations implied by the modelI'll focus on get_predicted()
first. So for the two main arguments, type & scale, I initially thought of having:
predict = "link" / "response"
: the problem is that predict
is also a function so it's not the best to have that name as an argumenttype = "link" / "response"
: I switched to type
, but the only problem is that this might be confusing in respect to thepredict()
API, that uses type
for the "scale"what
, method
, ...For the scale, I initially thought of having transform=TRUE
, but you mentioned that it might be the best choice. The problem with scale = "response" / "link"
is that it's also the name of a function, + it would have the same options as the previous argument, which makes it a bit confusing IMO. Other possibilities include rescale = TRUE
, unit = "model" / "response"
, ...
Why not stick with type = "link" / "response"
and interval = "confidence" / "prediction"
?
Because I find super confusing to call the scale the type, + for instance for Bayesian models it's not just the "interval" that is different. So interval is not really the best name, in the current implementation there is 'ci_type' but in the docs I have to explain that for Bayesian models that don't return CIs it is still this argument that modules the TYPE or prediction...
I find it less confusing (and is consistent) to have get_predicted(X, what="response"/"link", scale="response"/"model") or something like this
scale="response"/"model"
I find this very not-intuitive...
for Bayesian models it's not just the "interval" that is different
Not sure what you mean here... I think there just isn't an option to get prediction intervals in Bayes... no?
Well not "intervals" technically, but different posteriors distribution corresponding either to the link or the predictions. Which affect the intervals when summarized. But technically that's not an interval indeed, that's why I don't like the name interval
and would prefer type
or what
So in the latest implementation, I went for type
and scale
, however, there are now 3 possibilities for type
; "link" does what it does, "prediction" gives prediction intervals etc. (prediction on new data), but "response" now makes new observations similarly to its Bayesian counterpart; i.e., for logistic models it returns a vector of zeros and ones (I suspect that this option could be useful and convenient to machine learning folks that do classifications).
For LMs; type = "prediction" and type = "response" are equivalent; they both give "prediction" intervals. The "mean" prediction is the same for the three types (but not for Bayesian models, as it relies on posterior_predict instead of linpred, so it can vary slightly).
It might sound convoluted for now, but once I clean everything I'll better document it and hopefully it should make sense ^^
I don't like those argument names at all... The type of interval is not the type of prediction...
See my comments on this commit: https://github.com/easystats/insight/commit/5b50279100ea10e276c775de5b2ca6bceb309f84
Alright let me rephrase;
There's currently a function called get_predicted_ci which role is to compute CIs around a vector of predictions. One of the key argument for this function is ci_type
, which is the interval type and can take two values, confidence
or prediction
. I think that we all agree until here about these different types of intervals around predictions, one is related to the confidence around the link and the other to "new" predictions (e.g., on future data) as it integrates more variability.
Now the new get_predicted()
function calls the get_predicted_ci()
function to add CIs. So you might wonder, why not keep the ci_interval
argument, as it is the case in my OLD implementation (the new one is for now technically available via get_predicted_new()
). Because of consistency and conceptual usage.
For instance, in practice, for frequentist linear models, the "mean" prediction (got from predict()) is the same in all cases and as you mentioned, the only thing that changes is the interval (prediction vs. confidence). But that's just a mere artefact of how the models are made and is not conceptually equivalent. This shows in Bayesian models, in which predicting the link vs. predicting the response (i.e., making real predictions of observations) is in itself different (typically, predictions are not "ordered", i.e. ,they are independent at each draw).
library(ggplot2)
library(rstanarm)
x <- stan_glm(mpg ~ disp, data=mtcars, refresh = 0)
newdata <- data.frame(disp = c(50, 100, 150, 200))
ypred <- as.data.frame(t(posterior_linpred(x, newdata = newdata)))
names(ypred) <- paste0("iter", 1:ncol(ypred))
ypred <- bayestestR::reshape_iterations(cbind(newdata, ypred))
ggplot(ypred, aes(x = disp, y = iter_value, group = iter_group, color = iter_group)) +
geom_line(alpha=0.1)
ypred <- as.data.frame(t(posterior_predict(x, newdata = newdata)))
names(ypred) <- paste0("iter", 1:ncol(ypred))
ypred <- bayestestR::reshape_iterations(cbind(newdata, ypred))
ggplot(ypred, aes(x = disp, y = iter_value, group = iter_group, color = iter_group)) +
geom_line(alpha=0.1)
Created on 2021-03-03 by the reprex package (v0.3.0)
And it felt weird to use ci_type
in Bayesian contexts, in which there is no interval per se. So that's why, outside of get_predicted_ci
, the argument became type
(rather than ci_type
). However, we then discussed an even more subtle distinction, which are to make predictions or actually predict the responses, for which the output should have the same type as the original response vector (binary, for instance), this is done by Bayesian "response" prediction, and would make sense for Frequentist models as well (for instance, the caret
package for machine learning, automatically returns predictions that are 0-1 for logistic classifiers for instance).
So we end up with three "types" of predictions (and, consequently, intervals, which are a reflection of the conceptual nature of the predictions, rather than a thing in itself); the link (only variance of the link), predictions (taking into account more variance) and responses (which are predictions, but which output is of the same type as the response). Now these three "types" are not fully distinct, as depending on the framework they can be equivalent: for frequentist LMs, "predictions" and "response" are the same, and their estimate is identical to the "link".
On top of that, for the "link" and "prediction" type, we have the "unit" of the output or the "scale", whether it is expressed in the model's link unit (e.g., logodds) or on the response scale.
It's true that possibly, what I refer to as the type = "response" (i.e., the output of zeros and ones for logistic models) could be seen as a modulation of the scale
argument rather than of the type
, because for frequentist GLMs it's just a transformation of the probabilities to their most likely extreme. But the problem is that for Bayesian models it truly behave like a separate type (it is the output of posterior_predict())... Anyway, initially this "scale" arg was named transform=TRUE
, but as @mattansb mentioned it was not particularly intuitivel; scale
is clearer I think.
If you have better ideas please! My brain is hurting to think about that
The type
arg could be renamed to predict = "link" / "prediction" / "response"
to avoid confusion with base predict type, it's just than I 1) did'nt liked predict = "prediction"
(redundant!) and 2) liked the partial similarity between ci_type
and type
(since one is a consequence of the other)
Alright so I went back to a simple API with 2 options, predict = "link"
or predict = "response"
, which is probably the best name and it doesn't conflict with the base R arguments... for frequentist models they impact mostly the CI width, and for Bayesian models, a bit more than that but π€·ββοΈ I guess having a perfect symmetry in the output is inherently impossible
So I think we can close this ^^ I tried going as far as I could but finally came back to a more sensible implementation that closely follows the logic of the R implementation.
Ok, to summarize:
predict = "link"
computes confidence intervals, predicted values can be on the link or response scale.predict = "response"
computes prediction intervals, but predicted values are always on the response scalescale = "link"
computes predicted values on the link-scale, but only if predict = "link"
.scale = "response"
computes predicted values on the response scale, , but only if predict = "link"
.to get predicted values on the link-scale with confidence intervals, we need:
predict = "link"
, scale = "link"
to get predicted values on the response-scale with confidence intervals, we need:
predict = "link"
, scale = "response"
to get predicted values on the response-scale with prediction intervals, we need:
predict = "response"
(scale
will be ignored)
I'm not sure this is intuitive... why not predict
and ci_type
as arguments, where predict
is the former scale
, and ci_type
is the former predict
? Or maybe interval
instead of ci_type
? The latter having different values, like "confidence"
or "prediction"
.
Note that scale
as argument name can be misleading for, say, ZI-models. See type
argument for these two:
https://www.rdocumentation.org/packages/pscl/versions/1.5.5/topics/predict.hurdle https://www.rdocumentation.org/packages/glmmTMB/versions/1.0.2.1/topics/predict.glmmTMB
To me, it seems it all boils down to;
so why don't we have those two arguments, which have not the same (and do not share same) values? like:
I_want_to_predict = c("link", "response", "zi-prob", ...)
uncertainty_should_account_for = c("estimation", "prediction", "random_effect_variances", "all", ...)
Ok, to summarize:
- predict = "link" computes confidence intervals, predicted values can be on the link or response scale.
- predict = "response" computes prediction intervals, but predicted values are always on the response scale
- scale = "link" computes predicted values on the link-scale, but only if predict = "link".
- scale = "response" computes predicted values on the response scale, , but only if predict = "link".
Not really, the scale and the predict are fully independent; i.e., you can have scale link or response for both instances of predict
. Basically all the computations related the prediction and the CIs are done one the scale = "link"
. At the very end, only if scale = "response"
, they are transformed. So it's more like:
I'm not sure this is intuitive... why not predict and ci_type as arguments,
Cf. my comment above where it turns out that these "interval" modifies turn out to be interval modifiers only in frequentist contexts, but the underlying idea of them is a different type of what we want to predict.
Note that scale as argument name can be misleading for, say, ZI-models. See type argument for these two:
well we could have also scale = "count"
or "zero", that wouldn't strike me as inconsistent I think?
To me, it seems it all boils down to;
That's an interesting reframing!
We could have:
predict = c("response", "link", "zi-prob", ...)
: for LM, "response" and "link" are the same. For logistic, these modulate the "scale" (so-to-speak); response gives probs and link gives log odds.uncertainty = "parameters" (i.e., link), "prediction" (i.e., prediction interval), "random", "smooth" etc.
: the only problem is like are these elements additive or can you just have uncertainty = "random"
for instance? or just uncertainty = "prediction"
without "parameters" / link?Also, how this would work for Bayesian models, are they arguments like re.form to modulate that? Or we need to modify the newdata to remove the data of the sources we want to skip?
Okay, I may have led you astray here, so let me give this a shot:
What do you want?
βββ Simulated data from the population (which Bayes gives so easily).
βββ The *expected value*.
βββ On link scale: linear expected value + CI
βββ On response scale
βββ with confidence / credible interval
βββ with prediction interval
@5 yes, that's e.g. type = "response"
in glmmTMB, and in such cases, we need a different approach to compute CI.
What is the "expected value" with prediction interval for a Bayesian lm then?
Super easy!
library(rstanarm)
library(bayestestR)
mod <- stan_glm(vs ~ wt, data = mtcars, family = "binomial", refresh = 0, seed = 333)
pred_interval <- predictive_interval(mod)
pred_expected <- posterior_epred(mod)
pred_expected <- apply(pred_expected, 2, mean)
head(cbind(pred_expected, pred_interval))
#> pred_expected 5% 95%
#> Mazda RX4 0.6650345 0 1
#> Mazda RX4 Wag 0.5560404 0 1
#> Datsun 710 0.7682389 0 1
#> Hornet 4 Drive 0.3991759 0 1
#> Hornet Sportabout 0.3045867 0 1
#> Valiant 0.2969128 0 1
Alright I give up π this feature was too much too chew I need halp please give a look at get_predicted_new.lm and get_predicted_new.stanreg and help me trying to make it consistent and correct π
I think all the code base is already there (almost?), so mainly we have to decide on this:
https://github.com/easystats/insight/issues/310#issuecomment-790376789
Which argument (names), which options for those arguments? Then we can proceed.
Which argument (names), which options for those arguments? Then we can proceed.
that's the crux of the issue haha the code itself is working pretty well now π
What do you want?
βββ Simulated data from the population (which Bayes gives so easily).
βββ The *expected value*.
βββ On link scale: linear expected value + CI
βββ On response scale
βββ with confidence / credible interval
βββ with prediction interval
(side question; how do you do these ascii decision trees???)
So we could have:
βββ get_simulations(); get_simulated()
βββ get_predicted().
βββ predict = "link"
βββ predict = "response"
βββ ci_type = "confidence"
βββ ci_type = "prediction"
Which was somewhat close to the initial design π π«
Though @mattansb said "You can have a single expected value also for mixture (ZI) models, I think." I'm not sure how the above aligns well with the zi-etc.:
I_want_to_predict = c("link", "response", "zi-prob", ...)
As this doesn't fit
But again, for (freq) GLMs the computation of the ci_type is done on the link scale, and then transformed, so you can technically have different ci_types on the link scale?
Wouldn't something like this work?
βββ get_simulations(); get_simulated()
βββ get_predicted()
βββ predict = "zi-prob"
βββ ci_type = "confidence"
βββ ci_type = "prediction"
βββ predict = "link"
βββ ci_type = "confidence"
βββ ci_type = "prediction"
βββ predict = "response"
βββ ci_type = "confidence"
βββ ci_type = "prediction"
+ include_smooth; include_random; etc.
(though I kinda liked Daniel's proposal to have something like uncertainty_from = "all"
)
Also, maybe we should standardize the output; currently the main output is a vector of predictions (CIs and SE are stored as attributes) and a dataframe of draws for bootstrapped / Bayesian. Maybe we should either 1) summarize ourselves by Median/Mean so that it's always a vector (draws could be stored as attribute)? Or always return a dataframe (with CIs and all). Though I'd tend for the first option, as I'd see get_predicted()
as a robut predict()
which returns a vector, which is used for instance in performance to get indices of fit of what not.
Then, modelbased could focus on working with these predictions and providing a better output as a tidy dataframe that also includes the data/newdata, eventually the draws etc.
Prediction intervals don't make sense on the link scale - I suggest that option be dropped.
Also, for mixture models (such as zi) it would be best to always return a single value (@strengejacke said this is possible...?), I see no reason these should be an exception π€·ββοΈ
I agree we should return a vector with attributes and not a df. (Or have an output
argument for users to select?)
I agree we should return a vector with attributes and not a df. (Or have an output argument for users to select?)
For now we have as.data.frame()
method that returns a df with the SE/CI
Also, why not include smooth? π€
I made the ASCII tree manually π
Also, why not include smooth? π€
For GAMs, currently it's included by default, but one has the option to provide newdata without the smooth variable (in which case it will basically retrieve and add it fixed at its mean value) or just turn include_smooth
to FALSE which does the same. This is for the predicting function to "work" if no data for the smooth is supplied, which can be the case for visualization purposes in which we want only to visualize the link between other fixed effects.
Let's make a decision; what's your opinion for an API (names of the args, what they do depending on the combination that users input) that would work?
type = response (default) / link interval / interval_type = confidence / credible (default) or prediction {ignore and return CI if type == link}
Yes, that would be a good choice, or "predict" instead of "type", and maybe "include_sigma" (as logical) instead of interval, if you prefer different argument names than the classic R convention.
"Include sigma" would only make sense for prediction intervals of gaussian data.
Or "include_sd"? emmeans also refers to sigma, also for glm: https://cran.r-project.org/web/packages/emmeans/vignettes/predictions.html#sd-estimate
But according to ?sigma
, I would say include_sigma
is fine.
That is because emmeans (and most other pkgs) only supports PIs for gaussian models. So any ref to sigma/sd means we are limiting ourselves also only to those models.
interval / interval_type = confidence / credible (default) or prediction {ignore and return CI if type == link}
It would be best if we can avoid some arguments to be ignored depending on others.
What about:
βββ get_simulated()
βββ get_predicted()
βββ predict = "link" (-> ci_type = "confidence" & scale = "link" for GLMs)
βββ predict = "relation" (-> ci_type = "confidence" & scale = "response" for GLMs)
βββ predict = "prediction" (-> ci_type = "prediction" & scale = "response" for GLMs)
This would be rather clear, depending on the need of the user (visualizing some relationships in the model or making new predictions), there's only one argument for which we can detail all the implications.
It seems like there are essentially 3 arguments that are often entangled (and a bit ambiguous): type, interval and transform.
type
(and sometimestransform
) modulates the scale; for non-linear models, they lead to predictions and intervals expressed in the model's units (e.g., log-odds) or response (probabilities).interval
mostly for linear models, modulates the type and width of the CINow I want to get an intuitive and consistent API. The following is to help tidy things in my mind, and to make sure there's no major blindsight in the rational.
It seems to me that people use predictions either 1) to help understand and visualize the relationship or 2) want to make actual predictions (whether for same data to see the performance or new data). Thus, our API should reflect these two "kinds" of predictions as one of the most important distinction. Now, how does it relate to the
stats::predict()
API:For the easy case; if the
kind
of prediction is"link"
:transform = FALSE
If the
kind
of prediction is"response"
:what do you think