get_predicted: type, interval, and transform : design decisions

DominiqueMakowski commented 3 years ago

It seems like there are essentially 3 arguments that are often entangled (and a bit ambiguous): type, interval and transform.

type (and sometimes transform) modulates the scale; for non-linear models, they lead to predictions and intervals expressed in the model's units (e.g., log-odds) or response (probabilities).
interval mostly for linear models, modulates the type and width of the CI
One additional thing to note is that for Bayesian models, one has the possibility to return posteriors "even more" on the scale of the response, for instance only 0s and 1s (for binomial models), which is actually a "true" response prediction.

Now I want to get an intuitive and consistent API. The following is to help tidy things in my mind, and to make sure there's no major blindsight in the rational.

It seems to me that people use predictions either 1) to help understand and visualize the relationship or 2) want to make actual predictions (whether for same data to see the performance or new data). Thus, our API should reflect these two "kinds" of predictions as one of the most important distinction. Now, how does it relate to the stats::predict() API:

For the easy case; if the kind of prediction is "link":

for linear models
- the "interval" will be "confidence" (i.e., related to the link)
for binomial models
- the "interval" will be "confidence" even though there's not really any alternative (?)
- the transformation should still, by default, be "response": because people are in majority interested in visualizing/understanding the link in the response scale.
- The latter default transformation can be turned off by transform = FALSE

If the kind of prediction is "response":

for linear models
- the "interval" will be "prediction" (i.e., the so-called prediction interval)
for binomial models
- Do we return a "prediction" interval? Even if not really relevant (usually [0, 1])?
- Should the prediction be "fully" response-like, e.g., returning 0 or 1 depending on the most likely value?

what do you think

strengejacke commented 3 years ago

There is no "type" for linear models, response = link. In this particular case (i.e. lm) type lets you return predictions for response or terms.

strengejacke commented 3 years ago

type = "response" for binomial returns the predicted probabilities, while for type = "link", the predicted log-odds (I guess). So yes, there might be a prediction interval for binomial case and type = "response", or am I mistaken here?

strengejacke commented 3 years ago

pinging @mattansb as well.

DominiqueMakowski commented 3 years ago

There is no "type" for linear models

the "type" I'm referring too is unrelated to the "type" arg of stats::predict(), it's just our conceptualization (but we might want to find a better name for it; initially i thought predict = "link"/"response" but then predict is a function)

DominiqueMakowski commented 3 years ago

maybe what = "link"/"response"? Though I'm not a big fan of "what"-type args... but method doesn't really apply...

mattansb commented 3 years ago

Technically you can get a prediction interval for non-gaussian models:

library(magrittr)

mod <- glm(cyl ~ mpg, mtcars, family = poisson())

predict(mod, type = "response") %>% 
  head()
#>         Mazda RX4     Mazda RX4 Wag        Datsun 710    Hornet 4 Drive 
#>          5.748547          5.748547          5.305022          5.646886 
#> Hornet Sportabout           Valiant 
#>          6.369647          6.542429

predict.lm(mod, interval = "confidence", type = "response") %>% 
  poisson()$linkinv() %>%
  head()
#>                        fit      lwr      upr
#> Mazda RX4         5.748547 5.418348 6.098870
#> Mazda RX4 Wag     5.748547 5.418348 6.098870
#> Datsun 710        5.305022 4.954377 5.680485
#> Hornet 4 Drive    5.646886 5.313438 6.001260
#> Hornet Sportabout 6.369647 6.033157 6.724905
#> Valiant           6.542429 6.195060 6.909276

predict.lm(mod, interval = "prediction", type = "response") %>% 
  poisson()$linkinv() %>%
  head()
#> Warning in predict.lm(mod, interval = "prediction", type = "response"): predictions on current data refer to _future_ responses
#> Warning in predict.lm(mod, interval = "prediction", type = "response"): assuming prediction variance inversely proportional to weights used for fitting
#>                        fit      lwr      upr
#> Mazda RX4         5.748547 4.157805 7.947895
#> Mazda RX4 Wag     5.748547 4.157805 7.947895
#> Datsun 710        5.305022 3.781479 7.442396
#> Hornet 4 Drive    5.646886 4.071551 7.831739
#> Hornet Sportabout 6.369647 4.683892 8.662114
#> Valiant           6.542429 4.829782 8.862383

mod <- glm(am ~ mpg, mtcars, family = binomial())

predict(mod, type = "response") %>% 
  head()
#>         Mazda RX4     Mazda RX4 Wag        Datsun 710    Hornet 4 Drive 
#>         0.4610951         0.4610951         0.5978984         0.4917199 
#> Hornet Sportabout           Valiant 
#>         0.2969009         0.2599331

predict.lm(mod, interval = "confidence", type = "response") %>% 
  binomial()$linkinv() %>%
  head()
#>                         fit       lwr       upr
#> Mazda RX4         0.4610951 0.2518249 0.6850404
#> Mazda RX4 Wag     0.4610951 0.2518249 0.6850404
#> Datsun 710        0.5978984 0.3306147 0.8174011
#> Hornet 4 Drive    0.4917199 0.2707331 0.7159905
#> Hornet Sportabout 0.2969009 0.1387417 0.5253735
#> Valiant           0.2599331 0.1126400 0.4928556

predict.lm(mod, interval = "prediction", type = "response") %>% 
  binomial()$linkinv() %>%
  head()
#> Warning in predict.lm(mod, interval = "prediction", type = "response"): predictions on current data refer to _future_ responses

#> Warning in predict.lm(mod, interval = "prediction", type = "response"): assuming prediction variance inversely proportional to weights used for fitting
#>                         fit         lwr       upr
#> Mazda RX4         0.4610951 0.013685038 0.9813997
#> Mazda RX4 Wag     0.4610951 0.013685038 0.9813997
#> Datsun 710        0.5978984 0.021226422 0.9902866
#> Hornet 4 Drive    0.4917199 0.015531309 0.9834226
#> Hornet Sportabout 0.2969009 0.004739448 0.9739891
#> Valiant           0.2599331 0.003264507 0.9741371

^{Created on 2021-02-28 by the reprex package (v1.0.0)}

Which represent the same thing: what is the range of expected rates (Poisson) / probabilities (binomial) sampled from the (conditional) population.

You can think of this as equivalent to CIs from simulated data from Bayesian posteriors.

So you have:

The scale of prediction (link / response) - I agree the default should be "response".
The type of interval (prediction / confidence)

I don't like transform because it can often imply using the delta-method for CIs... Which is a whole other can of worms, I'm not sure we want to open.

DominiqueMakowski commented 3 years ago

Do we go for not-having a frequentist equivalent of posterior_predict and hiding this option from the Bayesian models? i.e., an option that makes the predictions to be only 0s and 1s? Or does it still make sense?

head(rstanarm::posterior_predict(rstanarm::stan_glm(vs ~ wt, data = mtcars, family = "binomial", refresh = 0, seed = 333)))
#>      Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive Hornet Sportabout
#> [1,]         1             0          0              0                 0
#> [2,]         1             1          1              0                 0
#> [3,]         0             1          0              0                 0
#> [4,]         1             1          1              0                 1
#> [5,]         1             0          1              0                 0
#> [6,]         1             1          1              0                 0
#>      Valiant Duster 360 Merc 240D Merc 230 Merc 280 Merc 280C Merc 450SE
#> [1,]       0          0         1        0        0         0          0
#> [2,]       0          0         1        0        1         0          0
#> [3,]       1          0         1        0        0         0          1
#> [4,]       0          1         1        0        0         1          0
#> [5,]       0          0         0        0        0         0          0
#> [6,]       0          0         1        1        0         0          0
#>      Merc 450SL Merc 450SLC Cadillac Fleetwood Lincoln Continental
#> [1,]          0           1                  0                   0
#> [2,]          0           0                  0                   0
#> [3,]          0           0                  0                   0
#> [4,]          0           0                  0                   0
#> [5,]          0           0                  0                   0
#> [6,]          0           1                  0                   0
#>      Chrysler Imperial Fiat 128 Honda Civic Toyota Corolla Toyota Corona
#> [1,]                 0        1           1              0             1
#> [2,]                 0        1           1              1             1
#> [3,]                 0        1           1              1             1
#> [4,]                 0        0           1              1             0
#> [5,]                 0        1           1              1             1
#> [6,]                 0        1           1              1             1
#>      Dodge Challenger AMC Javelin Camaro Z28 Pontiac Firebird Fiat X1-9
#> [1,]                0           0          0                0         0
#> [2,]                0           1          0                0         1
#> [3,]                0           0          0                0         0
#> [4,]                0           0          0                0         1
#> [5,]                0           0          0                0         1
#> [6,]                0           1          1                1         1
#>      Porsche 914-2 Lotus Europa Ford Pantera L Ferrari Dino Maserati Bora
#> [1,]             1            1              1            1             0
#> [2,]             1            1              1            0             0
#> [3,]             0            1              1            1             0
#> [4,]             1            1              0            1             0
#> [5,]             1            1              0            1             0
#> [6,]             1            1              0            1             1
#>      Volvo 142E
#> [1,]          1
#> [2,]          1
#> [3,]          1
#> [4,]          0
#> [5,]          1
#> [6,]          0

^{Created on 2021-02-28 by the reprex package (v0.3.0)}

mattansb commented 3 years ago

I think we need to separate "predicting the central tendency in the population (confidence interval) / future samples (prediction interval)" from simulating actual observations implied by the model.

DominiqueMakowski commented 3 years ago

by means of an argument? or by a different function?

mattansb commented 3 years ago

I think a different function. Don't we have some "simulate posterior" function? That can be expanded to also simulate the population.

strengejacke commented 3 years ago

Why do we need to re-invent the wheel, when there's already a simulate() method? We could of course have a "stable generic", like get_simulated(), but for common model types this is just a wrapper around simulate(). I don't quite understand what you're missing.

mattansb commented 3 years ago

Agreed 👍

DominiqueMakowski commented 3 years ago

so to summarize:

get_predicted() returns the central tendency in the population (confidence interval) or future samples (prediction interval), depending on the type of prediction.
simulate() / get_simulations() would generate actual observations implied by the model

DominiqueMakowski commented 3 years ago

I'll focus on get_predicted() first. So for the two main arguments, type & scale, I initially thought of having:

predict = "link" / "response": the problem is that predict is also a function so it's not the best to have that name as an argument
type = "link" / "response": I switched to type, but the only problem is that this might be confusing in respect to thepredict() API, that uses type for the "scale"
other possibilites are: what, method, ...

For the scale, I initially thought of having transform=TRUE, but you mentioned that it might be the best choice. The problem with scale = "response" / "link" is that it's also the name of a function, + it would have the same options as the previous argument, which makes it a bit confusing IMO. Other possibilities include rescale = TRUE, unit = "model" / "response", ...

mattansb commented 3 years ago

Why not stick with type = "link" / "response" and interval = "confidence" / "prediction"?

DominiqueMakowski commented 3 years ago

Because I find super confusing to call the scale the type, + for instance for Bayesian models it's not just the "interval" that is different. So interval is not really the best name, in the current implementation there is 'ci_type' but in the docs I have to explain that for Bayesian models that don't return CIs it is still this argument that modules the TYPE or prediction...

I find it less confusing (and is consistent) to have get_predicted(X, what="response"/"link", scale="response"/"model") or something like this

mattansb commented 3 years ago

scale="response"/"model" I find this very not-intuitive...

mattansb commented 3 years ago

for Bayesian models it's not just the "interval" that is different

Not sure what you mean here... I think there just isn't an option to get prediction intervals in Bayes... no?

DominiqueMakowski commented 3 years ago

https://github.com/easystats/insight/blob/657161cfc2f8711426ba0c7c4fc423b24938d621/R/get_predicted.R#L153-L161

Well not "intervals" technically, but different posteriors distribution corresponding either to the link or the predictions. Which affect the intervals when summarized. But technically that's not an interval indeed, that's why I don't like the name interval and would prefer type or what

DominiqueMakowski commented 3 years ago

So in the latest implementation, I went for type and scale, however, there are now 3 possibilities for type; "link" does what it does, "prediction" gives prediction intervals etc. (prediction on new data), but "response" now makes new observations similarly to its Bayesian counterpart; i.e., for logistic models it returns a vector of zeros and ones (I suspect that this option could be useful and convenient to machine learning folks that do classifications).

For LMs; type = "prediction" and type = "response" are equivalent; they both give "prediction" intervals. The "mean" prediction is the same for the three types (but not for Bayesian models, as it relies on posterior_predict instead of linpred, so it can vary slightly).

It might sound convoluted for now, but once I clean everything I'll better document it and hopefully it should make sense ^^

mattansb commented 3 years ago

I don't like those argument names at all... The type of interval is not the type of prediction...

strengejacke commented 3 years ago

See my comments on this commit: https://github.com/easystats/insight/commit/5b50279100ea10e276c775de5b2ca6bceb309f84

DominiqueMakowski commented 3 years ago

Alright let me rephrase;

There's currently a function called get_predicted_ci which role is to compute CIs around a vector of predictions. One of the key argument for this function is ci_type, which is the interval type and can take two values, confidence or prediction. I think that we all agree until here about these different types of intervals around predictions, one is related to the confidence around the link and the other to "new" predictions (e.g., on future data) as it integrates more variability.

Now the new get_predicted() function calls the get_predicted_ci() function to add CIs. So you might wonder, why not keep the ci_interval argument, as it is the case in my OLD implementation (the new one is for now technically available via get_predicted_new()). Because of consistency and conceptual usage.

For instance, in practice, for frequentist linear models, the "mean" prediction (got from predict()) is the same in all cases and as you mentioned, the only thing that changes is the interval (prediction vs. confidence). But that's just a mere artefact of how the models are made and is not conceptually equivalent. This shows in Bayesian models, in which predicting the link vs. predicting the response (i.e., making real predictions of observations) is in itself different (typically, predictions are not "ordered", i.e. ,they are independent at each draw).

library(ggplot2)
library(rstanarm)

x <- stan_glm(mpg ~ disp, data=mtcars, refresh = 0)
newdata <- data.frame(disp = c(50, 100, 150, 200))

ypred <- as.data.frame(t(posterior_linpred(x, newdata = newdata)))
names(ypred) <- paste0("iter", 1:ncol(ypred))
ypred <- bayestestR::reshape_iterations(cbind(newdata, ypred))

ggplot(ypred, aes(x = disp, y = iter_value, group = iter_group, color = iter_group)) +
  geom_line(alpha=0.1)


ypred <- as.data.frame(t(posterior_predict(x, newdata = newdata)))
names(ypred) <- paste0("iter", 1:ncol(ypred))
ypred <- bayestestR::reshape_iterations(cbind(newdata, ypred))

ggplot(ypred, aes(x = disp, y = iter_value, group = iter_group, color = iter_group)) +
  geom_line(alpha=0.1)

^{Created on 2021-03-03 by the reprex package (v0.3.0)}

And it felt weird to use ci_type in Bayesian contexts, in which there is no interval per se. So that's why, outside of get_predicted_ci, the argument became type (rather than ci_type). However, we then discussed an even more subtle distinction, which are to make predictions or actually predict the responses, for which the output should have the same type as the original response vector (binary, for instance), this is done by Bayesian "response" prediction, and would make sense for Frequentist models as well (for instance, the caret package for machine learning, automatically returns predictions that are 0-1 for logistic classifiers for instance).

So we end up with three "types" of predictions (and, consequently, intervals, which are a reflection of the conceptual nature of the predictions, rather than a thing in itself); the link (only variance of the link), predictions (taking into account more variance) and responses (which are predictions, but which output is of the same type as the response). Now these three "types" are not fully distinct, as depending on the framework they can be equivalent: for frequentist LMs, "predictions" and "response" are the same, and their estimate is identical to the "link".

On top of that, for the "link" and "prediction" type, we have the "unit" of the output or the "scale", whether it is expressed in the model's link unit (e.g., logodds) or on the response scale.

It's true that possibly, what I refer to as the type = "response" (i.e., the output of zeros and ones for logistic models) could be seen as a modulation of the scale argument rather than of the type, because for frequentist GLMs it's just a transformation of the probabilities to their most likely extreme. But the problem is that for Bayesian models it truly behave like a separate type (it is the output of posterior_predict())... Anyway, initially this "scale" arg was named transform=TRUE, but as @mattansb mentioned it was not particularly intuitivel; scale is clearer I think.

If you have better ideas please! My brain is hurting to think about that

DominiqueMakowski commented 3 years ago

The type arg could be renamed to predict = "link" / "prediction" / "response" to avoid confusion with base predict type, it's just than I 1) did'nt liked predict = "prediction" (redundant!) and 2) liked the partial similarity between ci_type and type (since one is a consequence of the other)

DominiqueMakowski commented 3 years ago

Alright so I went back to a simple API with 2 options, predict = "link" or predict = "response", which is probably the best name and it doesn't conflict with the base R arguments... for frequentist models they impact mostly the CI width, and for Bayesian models, a bit more than that but 🤷‍♂️ I guess having a perfect symmetry in the output is inherently impossible

So I think we can close this ^^ I tried going as far as I could but finally came back to a more sensible implementation that closely follows the logic of the R implementation.

strengejacke commented 3 years ago

Ok, to summarize:

predict = "link" computes confidence intervals, predicted values can be on the link or response scale.
predict = "response" computes prediction intervals, but predicted values are always on the response scale
scale = "link" computes predicted values on the link-scale, but only if predict = "link".
scale = "response" computes predicted values on the response scale, , but only if predict = "link".

to get predicted values on the link-scale with confidence intervals, we need: predict = "link", scale = "link"

to get predicted values on the response-scale with confidence intervals, we need: predict = "link", scale = "response"

to get predicted values on the response-scale with prediction intervals, we need: predict = "response" (scale will be ignored)

I'm not sure this is intuitive... why not predict and ci_type as arguments, where predict is the former scale, and ci_type is the former predict? Or maybe interval instead of ci_type? The latter having different values, like "confidence" or "prediction".

strengejacke commented 3 years ago

Note that scale as argument name can be misleading for, say, ZI-models. See type argument for these two:

https://www.rdocumentation.org/packages/pscl/versions/1.5.5/topics/predict.hurdle https://www.rdocumentation.org/packages/glmmTMB/versions/1.0.2.1/topics/predict.glmmTMB

strengejacke commented 3 years ago

To me, it seems it all boils down to;

what do we want to predict / estimate? (response, link, zero-inflated probability, dispersion...)
what uncertainty do we want to take into account? (around estimates, or for possible future predictions, or including random effects variation?)

so why don't we have those two arguments, which have not the same (and do not share same) values? like:

I_want_to_predict = c("link", "response", "zi-prob", ...) uncertainty_should_account_for = c("estimation", "prediction", "random_effect_variances", "all", ...)

DominiqueMakowski commented 3 years ago

Ok, to summarize:

predict = "link" computes confidence intervals, predicted values can be on the link or response scale.

predict = "response" computes prediction intervals, but predicted values are always on the response scale

scale = "link" computes predicted values on the link-scale, but only if predict = "link".

scale = "response" computes predicted values on the response scale, , but only if predict = "link".

Not really, the scale and the predict are fully independent; i.e., you can have scale link or response for both instances of predict. Basically all the computations related the prediction and the CIs are done one the scale = "link". At the very end, only if scale = "response", they are transformed. So it's more like:

predict = "link" computes confidence intervals (for frequentist models) and link-predictions (for Bayesian models)
predict = "response" computes prediction intervals (for frequentist models) and posterior-predictions (for Bayesian models)
scale = "link" computes predicted values on the link-scale
scale = "response" computes predicted values on the response scale

I'm not sure this is intuitive... why not predict and ci_type as arguments,

Cf. my comment above where it turns out that these "interval" modifies turn out to be interval modifiers only in frequentist contexts, but the underlying idea of them is a different type of what we want to predict.

Note that scale as argument name can be misleading for, say, ZI-models. See type argument for these two:

well we could have also scale = "count" or "zero", that wouldn't strike me as inconsistent I think?

To me, it seems it all boils down to;

That's an interesting reframing!

We could have:

predict = c("response", "link", "zi-prob", ...): for LM, "response" and "link" are the same. For logistic, these modulate the "scale" (so-to-speak); response gives probs and link gives log odds.
uncertainty = "parameters" (i.e., link), "prediction" (i.e., prediction interval), "random", "smooth" etc.: the only problem is like are these elements additive or can you just have uncertainty = "random" for instance? or just uncertainty = "prediction" without "parameters" / link?

Also, how this would work for Bayesian models, are they arguments like re.form to modulate that? Or we need to modify the newdata to remove the data of the sources we want to skip?

mattansb commented 3 years ago

Okay, I may have led you astray here, so let me give this a shot:

What do you want?
┌── Simulated data from the population (which Bayes gives so easily).
└── The *expected value*.
    ├── On link scale: linear expected value + CI
    └── On response scale
        ├── with confidence / credible interval
        └── with prediction interval

For linear models, the link / response give the same.
The prediction interval doesn't make sense on the link.
I gave you some bad info about how to get prediction intervals for GLMs. See these great posts by Gavin Simpson about how to do it properly:
- For binomial.
- For poisson.
I think simulated data from the population should have it's own function - prediction, to me, is always about the expected value.
You can have a single expected value also for mixture (ZI) models, I think.

strengejacke commented 3 years ago

@5 yes, that's e.g. type = "response" in glmmTMB, and in such cases, we need a different approach to compute CI.

DominiqueMakowski commented 3 years ago

What is the "expected value" with prediction interval for a Bayesian lm then?

mattansb commented 3 years ago

Super easy!

library(rstanarm)
library(bayestestR)

mod <- stan_glm(vs ~ wt, data = mtcars, family = "binomial", refresh = 0, seed = 333)

pred_interval <- predictive_interval(mod)
pred_expected <- posterior_epred(mod)
pred_expected <- apply(pred_expected, 2, mean)

head(cbind(pred_expected, pred_interval))
#>                   pred_expected 5% 95%
#> Mazda RX4             0.6650345  0   1
#> Mazda RX4 Wag         0.5560404  0   1
#> Datsun 710            0.7682389  0   1
#> Hornet 4 Drive        0.3991759  0   1
#> Hornet Sportabout     0.3045867  0   1
#> Valiant               0.2969128  0   1

DominiqueMakowski commented 3 years ago

Alright I give up 😂 this feature was too much too chew I need halp please give a look at get_predicted_new.lm and get_predicted_new.stanreg and help me trying to make it consistent and correct 🙏

strengejacke commented 3 years ago

I think all the code base is already there (almost?), so mainly we have to decide on this:

https://github.com/easystats/insight/issues/310#issuecomment-790376789

Which argument (names), which options for those arguments? Then we can proceed.

DominiqueMakowski commented 3 years ago

Which argument (names), which options for those arguments? Then we can proceed.

that's the crux of the issue haha the code itself is working pretty well now 😅

DominiqueMakowski commented 3 years ago

What do you want?
┌── Simulated data from the population (which Bayes gives so easily).
└── The *expected value*.
    ├── On link scale: linear expected value + CI
    └── On response scale
        ├── with confidence / credible interval
        └── with prediction interval

(side question; how do you do these ascii decision trees???)

So we could have:

┌── get_simulations(); get_simulated()
└── get_predicted().
    ├── predict = "link"
    └── predict = "response"
        ├── ci_type = "confidence"
        └── ci_type = "prediction"

Which was somewhat close to the initial design 😅 🔫

Though @mattansb said "You can have a single expected value also for mixture (ZI) models, I think." I'm not sure how the above aligns well with the zi-etc.:

I_want_to_predict = c("link", "response", "zi-prob", ...)

As this doesn't fit

DominiqueMakowski commented 3 years ago

But again, for (freq) GLMs the computation of the ci_type is done on the link scale, and then transformed, so you can technically have different ci_types on the link scale?

Wouldn't something like this work?

┌── get_simulations(); get_simulated()
└── get_predicted()
    └──  predict = "zi-prob"
        ├── ci_type = "confidence"
        └── ci_type = "prediction"
    └── predict = "link"
        ├── ci_type = "confidence"
        └── ci_type = "prediction"
    └── predict = "response"
        ├── ci_type = "confidence"
        └── ci_type = "prediction"
+ include_smooth; include_random; etc.

(though I kinda liked Daniel's proposal to have something like uncertainty_from = "all")

Also, maybe we should standardize the output; currently the main output is a vector of predictions (CIs and SE are stored as attributes) and a dataframe of draws for bootstrapped / Bayesian. Maybe we should either 1) summarize ourselves by Median/Mean so that it's always a vector (draws could be stored as attribute)? Or always return a dataframe (with CIs and all). Though I'd tend for the first option, as I'd see get_predicted() as a robut predict() which returns a vector, which is used for instance in performance to get indices of fit of what not.

Then, modelbased could focus on working with these predictions and providing a better output as a tidy dataframe that also includes the data/newdata, eventually the draws etc.

mattansb commented 3 years ago

Prediction intervals don't make sense on the link scale - I suggest that option be dropped.

Also, for mixture models (such as zi) it would be best to always return a single value (@strengejacke said this is possible...?), I see no reason these should be an exception 🤷‍♂️

I agree we should return a vector with attributes and not a df. (Or have an output argument for users to select?)

DominiqueMakowski commented 3 years ago

I agree we should return a vector with attributes and not a df. (Or have an output argument for users to select?)

For now we have as.data.frame() method that returns a df with the SE/CI

mattansb commented 3 years ago

Also, why not include smooth? 🤔

I made the ASCII tree manually 😅

DominiqueMakowski commented 3 years ago

Also, why not include smooth? 🤔

For GAMs, currently it's included by default, but one has the option to provide newdata without the smooth variable (in which case it will basically retrieve and add it fixed at its mean value) or just turn include_smooth to FALSE which does the same. This is for the predicting function to "work" if no data for the smooth is supplied, which can be the case for visualization purposes in which we want only to visualize the link between other fixed effects.

DominiqueMakowski commented 3 years ago

Let's make a decision; what's your opinion for an API (names of the args, what they do depending on the combination that users input) that would work?

mattansb commented 3 years ago

type = response (default) / link interval / interval_type = confidence / credible (default) or prediction {ignore and return CI if type == link}

strengejacke commented 3 years ago

Yes, that would be a good choice, or "predict" instead of "type", and maybe "include_sigma" (as logical) instead of interval, if you prefer different argument names than the classic R convention.

mattansb commented 3 years ago

"Include sigma" would only make sense for prediction intervals of gaussian data.

strengejacke commented 3 years ago

Or "include_sd"? emmeans also refers to sigma, also for glm: https://cran.r-project.org/web/packages/emmeans/vignettes/predictions.html#sd-estimate

strengejacke commented 3 years ago

But according to ?sigma, I would say include_sigma is fine.

mattansb commented 3 years ago

That is because emmeans (and most other pkgs) only supports PIs for gaussian models. So any ref to sigma/sd means we are limiting ourselves also only to those models.

DominiqueMakowski commented 3 years ago

interval / interval_type = confidence / credible (default) or prediction {ignore and return CI if type == link}

It would be best if we can avoid some arguments to be ignored depending on others.

What about:

┌── get_simulated()
└── get_predicted()
    └── predict = "link" (-> ci_type = "confidence" & scale = "link" for GLMs)
    └── predict = "relation" (-> ci_type = "confidence" & scale = "response" for GLMs)
    └── predict = "prediction" (-> ci_type = "prediction" & scale = "response" for GLMs)

This would be rather clear, depending on the need of the user (visualizing some relationships in the model or making new predictions), there's only one argument for which we can detail all the implications.

easystats / insight

get_predicted: type, interval, and transform : design decisions #310