Hypothesis testing in the Bayesian framework

mattansb commented 5 years ago

From here..

Hypothesis testing framework

Probability of direction

pd is a measure of existence based only on the posterior - it is the maximal percent of the posterior that is on some side of zero. In a hypothesis testing framework, it tests

H0: theta = 0
H1: theta>0 OR theta<0 (not theta≠0, as it takes the maximal side only)

(??the pd is a measure of certainty - how certain we are that theta is not 0 (by testing how probable is the most probable sign of theta).

ROPE

ROPE is a measure of significance that is based on the posterior and on some pre-conceived notion of a "small" effect. In a hypothesis testing framework, it tests

H0: theta ∈ ROPE
H1: theta ∉ ROPE

Bayes factors

BF is a relative measure of evidence. In the case where one model is the point null, it tests the relative probability of the data between the models. In a hypothesis testing framework, it tests

H1: Data were generated by M₁ (as specified by priors)
H2: Data were generated by M₂ (as specified by priors)

p-MAP

There's also the p-MAP, which isn't getting much love by us... We are waiting for the feedback from Prof Jeff Mills from which this index was inspired.

What is common between the indices

Are comparative: ROPE, BF, p-MAP
Examine the posterior values: pd, ROPE, p-MAP
Tell you if an effect exists: pd, p-MAP, BF (when comparing to the point-null)
Tell you if an effect is significant: ROPE, BF (when comparing to a interval-null, which a user would have to construct themselves...)
Can be used to "test" a single parameters: all (BF by comparing to a null constraint on that parameter).
Can be used to compare models: BF
Can be used to support the null: BF, ROPE

I think the main vignette and guidelines should be along one or more of these ^ lines...

It would be interesting to formalize it and develop it even further. Maybe starting with a blogpost? "Hypothesis testing in the Bayesian framework". Such conceptualization could potentially also be integrated as a paragraph in the intro of the significance paper.

DominiqueMakowski commented 5 years ago

@mattansb out of curiosity, I was wondering what do you think of the p-MAP, which Jeff Mills calls "the Bayesian p-value" in his talk? As it seems to offer in theory the "best of both worlds", i.e., it can give evidence for the null (which I remember you mentioned as the main benefit of BF), but it also does not suffer from all the limitations of the BFs. Moreover, it is straightforward to compute, understand and seems (at least that's what Jeff suggests) mathematically grounded...

mattansb commented 5 years ago

I have some thoughts:

First, I'm not sure why this is dubbed a "p-value" - it is a ratio (because the denominator is the MAP it is by definition <= 1, but still not a probability), making it more like a BF than a p-value.

Second, I don't see how it lends itself to support the null any more than a p-value - at best, when MAP is the null, the p-MAP is 1. This is also true for p-values - when the estimate is the null, the p-value is 1. Since the latter cannot be used to support the null, I don't see how the former can. (I guess this is why it is the Bayesian p-value). Also, because it answers the question "how much more probable is the MAP than the null" it is by definition looking for evidence for the anything (the "best case scenario" via the MAP) over the null, but cannot provide evidence for the null.

Finally, it does not really deal with the problem of choosing a prior - it only deal with a problem of choosing a weak/non-informative prior. But when you have strong priors you get a reversed Jeffreys-Lindley-Bartlett paradox:

library(bayestestR)
library(rstanarm)

stan_glm_shhh <- function(...){
  capture.output(fit <- stan_glm(...))
  fit
}

X <- rnorm(100)
Y <- X + rnorm(100, sd = 0.1)

cor(X,Y) # data points to a strong effects
#> [1] 0.9953305

fit <- stan_glm_shhh(Y ~ X, 
                     prior = normal(0,0.001)) # strong priors for null effect

p_map(fit) # points to no effect!
#> # MAP-based p-value
#> 
#>   (Intercept): 0.978
#>   X          : 1.000

X <- rnorm(10000)
Y <- rnorm(10000)

cor(X,Y) # data points to no effect
#> [1] -0.0205174

fit <- stan_glm_shhh(Y ~ X, 
                     prior = normal(1,0.001)) # strong priors against null effect

p_map(fit) # points to a true effect!
#> # MAP-based p-value
#> 
#>   (Intercept): 0.713
#>   X          : 0.000

^{Created on 2019-06-19 by the reprex package (v0.3.0)}

DominiqueMakowski commented 5 years ago

First, I'm not sure why this is dubbed a "p-value" - it is a ratio (because the numerator is the MAP it is by definition >= 1, but still not a probability), making it more like a BF than a p-value.

I agree, a "Bayesian p-value" refers IMO more to the pd than to this ratio.

Also, because it answers the question "how much more probable is the MAP than the null" it is by definition looking for evidence for the anything (the "best case scenario" via the MAP) over the null, but cannot provide evidence for the null.

good point.

But when you have strong priors you get a reversed Jeffreys-Lindley-Bartlett paradox:

Interesting interesting.

As Justin Bieber recently challenged Tom Cruise for an MMA fight in an octogone, I am thinking about organizing a tournament with Wagenmakers, Mills, Kruschke, the stan people, you and Daniel. I will be taking the bets 💰 💰

mattansb commented 5 years ago

Just like in the Bieber vs. Cruise case, I'm sure its obvious who would be the ultimate MBA (Mixed Bayesian arts) champion 😜

BTW, the BF here performs here as expected: For the first model, the priors of both the point-null model and the "alternative" are so similar, that BF = 1:

#> Computation of Bayes factors: estimating marginal likelihood, please wait...
#> Bayes factor analysis
#> ---------------------             
#> [2] X    1.01
#> 
#> Against denominator:
#>       [1] (Intercept only)   
#> ---
#> Bayes factor type:  marginal likelihoods (bridgesampling)

And for the second model, the priors of point-null model are wayyyy more appropriate that the alternative model, that BF <<< 1:

#> Computation of Bayes factors: estimating marginal likelihood, please wait...
#> Bayes factor analysis
#> ---------------------                 
#> [2] X    6.46e-14
#> 
#> Against denominator:
#>       [1] (Intercept only)   
#> ---
#> Bayes factor type:  marginal likelihoods (bridgesampling)

(Note that I used the compare-models function and not the Savage-Dickey function because for the second model the prior and posterior samples were both so close together and far from 0 that estimation failed (NaN), but for the first model BF was ~1).

mattansb commented 5 years ago

BTW, the reversed Jeffreys-Lindley-Bartlett paradox holds also for pd and ROPE, CI, Median... and any other measure that is only based on the posterior.

To summarize:

If you don't have any priors (weak / non-informative), it is silly to use BFs as their whole point is to compare two sets of priors. In such cases, the posterior is driven almost 100% by the observed data, and thus it makes sense to explore the posterior (with pd, ROPE, p-MAP just to see if what you observe isn't 0).
If you have super strong priors, inferring from the posterior is silly as the posterior is driven almost 100% by your priors.,... But it would make sense to see how your strong priors hold up against another set of priors.
If you have some (even minimally) informed priors, you get to do both: the posteriors are (probably) mostly data driven, so it's a good idea to look at them (for estimating, and maybe also seeing if they differ from 0), and also you can compare different models with different priors to see which is better.

FIN

easystats / blog