easystats / bayestestR

:ghost: Utilities for analyzing Bayesian models and posterior distributions
https://easystats.github.io/bayestestR/
GNU General Public License v3.0
577 stars 55 forks source link

Adding support interval (SI) #266

Closed DominiqueMakowski closed 4 years ago

DominiqueMakowski commented 4 years ago

https://www.bayesianspectacles.org/the-support-interval/

SIBF > 1 [SI_low, SI_high]: the interval of which the parameter has a BF > 1 (in which there is support for the alternative). Not sure how computationally heavy this would be. Not sure also if more appropriately named Support Interval (SI) or Confirmation Interval (CI) (the latter might add up to the confusion related to the existence of different CIs, mainly for the sake of our habits of writing CI?)

mattansb commented 4 years ago

I like this idea very much!

Looks like most of what is needed for this is already part of the bf_params code. So with a little cannibalization:

si <- function(posterior, prior, BF = 1) {
  extend_scale <- 0.05
  precision <- 2^8

  x <- c(prior, posterior)
  x_range <- range(x)
  x_rangex <- stats::median(x) + 7 * stats::mad(x) * c(-1, 1)
  x_range <- c(
    max(c(x_range[1], x_rangex[1])),
    min(c(x_range[2], x_rangex[2]))
  )

  extension_scale <- diff(x_range) * extend_scale
  x_range[1] <- x_range[1] - extension_scale
  x_range[2] <- x_range[2] + extension_scale

  x_axis <- seq(x_range[1], x_range[2], length.out = precision)

  f_prior <- logspline::logspline(prior)
  f_posterior <- logspline::logspline(posterior)
  d_prior <- logspline::dlogspline(x_axis, f_prior)
  d_posterior <- logspline::dlogspline(x_axis, f_posterior)

  relative_d <- d_posterior / d_prior

  range(x_axis[relative_d > BF])
}

library(bayestestR)
prior <- distribution_normal(1000, mean = 0, sd = 1)
posterior <- distribution_normal(1000, mean = .5, sd = .3)

# looks like is makes sense:
(si_vals <- si(posterior,prior))
#> [1] 0.03999124 1.05310270
plot(bf_parameters(prior,posterior = posterior, null = si_vals))


# more conservative
(si_vals <- si(posterior,prior, BF = 3))
#> [1] 0.3332603 0.7598336
plot(bf_parameters(prior,posterior = posterior, null = si_vals))

Created on 2019-12-29 by the reprex package (v0.3.0)

This is a basic function - it would need methods for extracting priors from stanreg/brmsfit models etc, just like in bf_params, and also a method for one sided BFs I think (but not for interval-null)

What do you think?

DominiqueMakowski commented 4 years ago

Nice! I agree with all that. I wonder what's the best API for that though, as one could imagine:

mattansb commented 4 years ago

I think we need:

DominiqueMakowski commented 4 years ago

True, they are not incompatible 😁

mattansb commented 4 years ago
library(bayestestR)
x0 <- distribution_normal(1000, mean = 0, sd = 1)
x1 <- distribution_normal(1000, mean = .5, sd = .3)

results <- si(x1,x0)
#> Loading required namespace: logspline
print.data.frame(results)
#>   CI             CI_low          CI_high
#> 1  1 0.0399912416092305 1.05310269570974

library(rstanarm)
stan_model <- stan_lmer(extra ~ group + (1 | ID), data = sleep, refresh = 0)
results <- si(stan_model)
#> Computation of Bayes factors: sampling priors, please wait...
print.data.frame(results)
#>     Parameter CI             CI_low          CI_high Effects   Component
#> 1 (Intercept)  1 -0.901647242478401  2.4186288271425   fixed conditional
#> 2      group2  1  0.604759242670623 2.60240882765198   fixed conditional

library(emmeans)
group_diff <- pairs(emmeans(stan_model, ~group))
results <- si(group_diff, prior = stan_model, BF = 3)
#> Computation of Bayes factors: sampling priors, please wait...
print.data.frame(results)
#>   Parameter CI            CI_low            CI_high
#> 1     1 - 2  3 -2.23929218153439 -0.927737308109037

I tried using the current printing method for CIs but:

  1. The ci level here should be something like BF = 1 and not 1%.
  2. It gives the wrong values for some reason...
    print.data.frame(results)
    #>   Parameter CI            CI_low            CI_high
    #> 1     1 - 2  3 -2.23929218153439 -0.927737308109037
    results
    #> # Support Interval
    #> 
    #>  Parameter        1% SI
    #>      1 - 2 [1.00, 1.00]

    Also, I couldn't think of a way to make si work with ci(..., method = "SI") (because si() has prior and BF arguments that ci() doesn't, and ci() has a ci argument that si() doesn't).


Need to add:

Advanced:

mattansb commented 4 years ago

@DominiqueMakowski @strengejacke care to check why the printing method is messed up?

Also, since this morning, I've come down with the flu... so my to-dos from above are on hold... πŸ€’πŸ€§

mattansb commented 4 years ago

A Thought: not sure si has to be a method of ci, as it does not return a credible interval - which is concerned with the posterior; instead the SI is concerned, like the BF, with the integration of the priors with the data.

DominiqueMakowski commented 4 years ago

care to check why the printing method is messed up?

Will do asap ☺️

A Thought: not sure si has to be a method of ci, as it does not return a credible interval - which is concerned with the posterior; instead the SI is concerned, like the BF, with the integration of the priors with the data.

That's true if ci() strictly refers to credible intervals. But that's already not really the case since the same method is used for confidence intervals for freq models. So that wouldn't be that big of a stretch to extend it to cover confirmation intervals. But as I said above, I am not sure of my opinion about this multiplication of what the CI acronym refers to πŸ˜•

In any case, do take a rest, you've done more than enough for this year 😁 Wishing you a swift recovery!!

mattansb commented 4 years ago

Alright, this is only missing a plotting method (that can be added later) and the ci(..., method = "SI").

One problem with adding ci(..., method = "SI") is that the ci argument defaults to 0.89, whereas the corresponding BF argument in si() defaults to 1. Thoughts?

mattansb commented 4 years ago

Was able to solve the printing issue (I've adopted the terminology from the original paper in both the function and the printing):

library(bayestestR)

prior <- distribution_normal(1000, mean = 0, sd = 1)
posterior <- distribution_normal(1000, mean = .5, sd = .3)

si(posterior,prior, BF = 1/3)
#> Loading required namespace: logspline
#> # Support Interval
#> 
#>  BF = 0.33333 SI
#>    [-0.12, 1.21]

Created on 2019-12-31 by the reprex package (v0.3.0)

strengejacke commented 4 years ago

The plotting method is a bit trickier than just wrapping plot.bf_parameters(). For instance, following model:

library(bayestestR)
library(rstanarm)
model <- stan_glm(mpg ~ gear + wt + cyl, data = mtcars)
si(model)
#> Computation of Bayes factors: sampling priors, please wait...
#> Loading required namespace: logspline
#> # Support Interval
#> 
#>    Parameter      BF = 1 SI
#>  (Intercept) [30.52, 54.75]
#>         gear [-2.49,  1.46]
#>           wt [-5.32, -1.47]
#>          cyl [-2.55, -0.50]

returns multiple si's, and the null-argument in bf_parameters() only accepts a single vector. Possible solutions:

a) save all data frames from posterior and prior samples as attribute to the returned si-object (making it very large) b) export bayestestR::.update_to_priors() (i.e. make it accessible from see), so we can sample priors again from the model object when we plot the data.

I think we need all prior/posterior data for each parameter, to correctly plot the different si's.

mattansb commented 4 years ago

Sorry, when I meant a wrapper, I meant a wrapper to the internal bayestestR:::.make_BF_plot_data function which makes the data for plotting bf_parameters... which is what I did (just pushed to see):

library(bayestestR)

prior <- distribution_normal(1000, mean = 0, sd = 1)
posterior <- distribution_normal(1000, mean = .5, sd = .3)

x <- si(posterior, prior, BF = 3)
#> Loading required namespace: logspline
plot(x)


library(rstanarm)
#> Loading required package: Rcpp
#> Registered S3 method overwritten by 'xts':
#>   method     from
#>   as.zoo.xts zoo
#> rstanarm (Version 2.19.2, packaged: 2019-10-01 20:20:33 UTC)
#> - Do not expect the default priors to remain the same in future rstanarm versions.
#> Thus, R scripts should specify priors explicitly, even if they are just the defaults.
#> - For execution on a local, multicore CPU with excess RAM we recommend calling
#> options(mc.cores = parallel::detectCores())
#> - bayesplot theme set to bayesplot::theme_default()
#>    * Does _not_ affect other ggplot2 plots
#>    * See ?bayesplot_theme_set for details on theme setting
contrasts(sleep$group) <- contr.bayes # see vingette
stan_model <- stan_lmer(extra ~ group + (1 | ID), data = sleep, refresh = 0)
x <- si(stan_model, BF = 6)
#> Computation of Bayes factors: sampling priors, please wait...
plot(x, show_intercept = TRUE) +
  ggplot2::coord_cartesian(xlim = c(-5,5)) # wide priors suckkkkkkk (for plotting)

Created on 2019-12-31 by the reprex package (v0.3.0)

mattansb commented 4 years ago

This looks more like the bf_params plot than the hdi plot, but I think that given the close relationship between SIs and BFs that works quite well!

mattansb commented 4 years ago

Alright, I think I'm done here - added SI as a method to ci() and to descrive_posterior():

library(bayestestR)

# Numeric -----------------------------------------------------------------

prior <- distribution_normal(1000, mean = 0, sd = 1)
posterior <- distribution_normal(1000, mean = .5, sd = .3)

set.seed(5)
si(posterior, prior, BF = 3)
#> Loading required namespace: logspline
#> # Support Interval
#> 
#>     BF = 3 SI
#>  [0.33, 0.76]

set.seed(5)
ci(posterior, method = "SI", prior = prior, BF = 3)
#> # Support Interval
#> 
#>     BF = 3 SI
#>  [0.33, 0.76]

set.seed(5)
describe_posterior(posterior, ci_method = "SI", bf_prior = prior, BF = 3)
#>   Parameter Median CI    CI_low   CI_high    pd ROPE_CI ROPE_low ROPE_high
#> 1 Posterior    0.5  3 0.3332603 0.7598336 0.953      89     -0.1       0.1
#>   ROPE_Percentage
#> 1      0.04040404

# Model -------------------------------------------------------------------

library(rstanarm)
#> Loading required package: Rcpp
#> Registered S3 method overwritten by 'xts':
#>   method     from
#>   as.zoo.xts zoo
#> rstanarm (Version 2.19.2, packaged: 2019-10-01 20:20:33 UTC)
#> - Do not expect the default priors to remain the same in future rstanarm versions.
#> Thus, R scripts should specify priors explicitly, even if they are just the defaults.
#> - For execution on a local, multicore CPU with excess RAM we recommend calling
#> options(mc.cores = parallel::detectCores())
#> - bayesplot theme set to bayesplot::theme_default()
#>    * Does _not_ affect other ggplot2 plots
#>    * See ?bayesplot_theme_set for details on theme setting
model <- stan_glm(mpg ~ wt, data = mtcars, chains = 2, iter = 200, refresh = 0)
#> Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
#> Running the chains for more iterations may help. See
#> http://mc-stan.org/misc/warnings.html#bulk-ess
#> Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
#> Running the chains for more iterations may help. See
#> http://mc-stan.org/misc/warnings.html#tail-ess

set.seed(5)
si(model, BF = 3)
#> Computation of Bayes factors: sampling priors, please wait...
#> # Support Interval
#> 
#>    Parameter      BF = 3 SI
#>  (Intercept) [33.71, 42.02]
#>           wt [-6.44, -4.25]

set.seed(5)
ci(model, method = "SI", BF = 3)
#> Computation of Bayes factors: sampling priors, please wait...
#> # Support Interval
#> 
#>    Parameter      BF = 3 SI
#>  (Intercept) [33.71, 42.02]
#>           wt [-6.44, -4.25]

set.seed(5)
describe_posterior(model, ci_method = "SI", BF = 3)
#> Computation of Bayes factors: sampling priors, please wait...
#> # Description of Posterior Distributions
#> 
#>    Parameter Median CI CI_low CI_high pd ROPE_CI ROPE_low ROPE_high
#>  (Intercept)  37.39  3  33.71   42.02  1      89   -0.603     0.603
#>           wt  -5.36  3  -6.44   -4.25  1      89   -0.603     0.603
#>  ROPE_Percentage  Rhat ESS
#>                0 0.990 267
#>                0 0.991 303

# emmeans (to be safe) ----------------------------------------------------

library(emmeans)
#> Welcome to emmeans.
#> NOTE -- Important change from versions <= 1.41:
#>     Indicator predictors are now treated as 2-level factors by default.
#>     To revert to old behavior, use emm_options(cov.keep = character(0))
et_wt <- emtrends(model, ~1, "wt")

set.seed(5)
si(et_wt, prior = model, BF = 3)
#> Computation of Bayes factors: sampling priors, please wait...
#> # Support Interval
#> 
#>  Parameter      BF = 3 SI
#>    overall [-6.54, -4.34]

set.seed(5)
ci(et_wt, method = "SI", prior = model, BF = 3)
#> Computation of Bayes factors: sampling priors, please wait...
#> # Support Interval
#> 
#>  Parameter      BF = 3 SI
#>    overall [-6.54, -4.34]

set.seed(5)
describe_posterior(et_wt, ci_method = "SI", bf_prior = model, BF = 3)
#> Computation of Bayes factors: sampling priors, please wait...
#>   Parameter    Median CI    CI_low   CI_high pd ROPE_CI ROPE_low ROPE_high
#> 1   overall -5.364072  3 -6.535697 -4.338882  1      89     -0.1       0.1
#>   ROPE_Percentage
#> 1               0

Created on 2020-01-02 by the reprex package (v0.3.0)

strengejacke commented 4 years ago

In the paramaters readme, we say that

The column names of the returned data frame are specific to their content.

In this case, CI would be BF and CI_* would be SI_*. Should we also make the column names context specific? This is very likely to break code. Or we have a print() method that changes the column names for the output only.

mattansb commented 4 years ago

Currently the printing method of bayestestr_si takes care of that... Anyway, this is an edge case where I think we can let it slide, but I can try and see how much code this will break over the weekend.

strengejacke commented 4 years ago

Sorry for confusion, I meant the describe_posterior() method. I committed a change that should print context specific column names.

strengejacke commented 4 years ago

Not sure how time consuming sampling from priors can become, but would it make sense to use simulate_priors() instead?

mattansb commented 4 years ago

Sampling from priors take some time - but less time than sampling from the posterior. I haven't really played with the simulation function yet. One problem with that method is that it doesn't account for multivariate dependency in the parameter space, so it will always lead to missestimated Bayes factors / SIs for emmGrid objects, but I'm not sure how it holds up for univariate estimation of single parameters... I would have to look into that + see if I can speed up prior sampling.

DominiqueMakowski commented 4 years ago

One problem with that method is that it doesn't account for multivariate dependency in the parameter space

the simulate priors have so easy implementations that I'd be surprised if Paul didn't think about this solution and that sampling for prior was equivalent but slower than it. Until we check, I'd agree that it's better to stick with the original

strengejacke commented 4 years ago

ok. So we could actually close this issue... And have to check why travis fails?!? Local checks are passing.

DominiqueMakowski commented 4 years ago

And have to check why travis fails?

I'll investigate πŸ•΅

mattansb commented 4 years ago

And have to check why travis fails?!?

Why indeed????

if Paul didn't think about this solution [...] I'd agree that it's better to stick with the original

Yes, it seems to be better over all, even if slower. I was able to speed it up a bit for some cases. Let's leave it at that.

Closing this issue now.

mattansb commented 4 years ago

@strengejacke made a minor change to the printing method of describe_posterior so that there wouldn't be two columns labeled BF, so that this:

library(bayestestR)
library(rstanarm)

model <- stan_glm(mpg ~ wt + gear, data = mtcars, chains = 2, iter = 200, refresh = 0)

describe_posterior(model, ci_method = "SI", test = "bf")
#> Computation of Bayes factors: sampling priors, please wait...
#> Loading required namespace: logspline
#> # Description of Posterior Distributions
#> 
#>    Parameter Median BF SI_low SI_high       BF  Rhat ESS
#>  (Intercept) 39.064  1  28.91   49.11 9.72e+08 0.997 164
#>           wt -5.450  1  -6.92   -3.59 6.05e+02 0.998 170
#>         gear -0.405  1  -2.48    1.72 4.76e-02 0.997 166

Created on 2020-01-03 by the reprex package (v0.3.0)

is now this:

library(bayestestR)
library(rstanarm)

model <- stan_glm(mpg ~ wt + gear, data = mtcars, chains = 2, iter = 200, refresh = 0)

describe_posterior(model, ci_method = "SI", test = "bf")
#> Computation of Bayes factors: sampling priors, please wait...
#> Loading required namespace: logspline
#> # Description of Posterior Distributions
#> 
#>    Parameter Median SI SI_low SI_high       BF  Rhat ESS
#>  (Intercept) 40.102  1  27.57   50.15 2.02e+06 0.994 152
#>           wt -5.500  1  -7.26   -3.54 1.22e+03 0.995 131
#>         gear -0.444  1  -2.62    1.84 7.82e-02 0.995 148

Created on 2020-01-03 by the reprex package (v0.3.0)