easystats / blog

:mega: The collaborative blog
https://easystats.github.io/blog/
10 stars 1 forks source link

comment on the p-direction post #40

Open bob-carpenter opened 4 years ago

bob-carpenter commented 4 years ago

I couldn't figure out how to leave comments on the blog, so I hope you don't mind if I open an issue relating to the post the p-direction. Apologies in advance if this is obvious and just considered too much for the blog post!

You can compute traditional p-values for Bayesian estimators using the bootstrap. Using max a posteriori (MAP) will then produce results identical to the traditional p-value derived from penalized maximum likelihood where the prior is considered the "penalty". But MAP isn't a Bayesian estimator and doesn't have the nice properties of the two common Bayesian estimators, the posterior mean (minimizes expected square error) and posterior median (mimimizes expected absolute error). Deriving a point estimate isn't particularly Bayesian, but at least the posterior mean and median have natural probabilistic interpretations as an expectation and the point at which a 50% probability obtains. With those estimators, results will vary from MAP based on how skewed the distribution is.

A bigger issue is that MAP doesn't even exist for our bread and butter hierarchical models. The frequentist approach is to use maximum marginal likelihood (this is often called "empirical Bayes" in that the MML estimate is for the hierarchical or "prior" parameters). This leads to underestimates of lower-level regression coefficient uncertainty by construction, as you see in packages like lme4 in R.

Part of the point of Bayesian inference is to not have to collapse to a point estimate. When we want to do downstream predictive inference, we don't want to just plug in an estimate, we want to do posterior predictive inference and average over our uncertainty in parameter estimation.

Defining what it means for a prior to be "informative" is tricky and it wasn't defined in this post. This is particularly vexed because of changes of variables. A prior defined for a probability variable in [0, 1] that's flat is very different from a flat prior for a log odds variable in (-infinity, infinity). A flat prior on [0, 1] under the logit transform leads to a standard logistic prior on the log odds. That's not flat. In MLE, changing variables doesn't matter, but it does in Bayes.

I wouldn't say that changing the threshold for significance with regularization is a good thing. While regularlization can be good for error control (trading variance for bias), the whole notion of a dichotomous up/down decision through signfiicance is the problem, not the threshold used. Also, we tend to use regularization that is not to zero, but to the population mean. This is also common in frequentist penalized maximum likelihood estimates (see, e.g., Efron and Morris's famous paper on predicting batting average in baseball using empirical Bayes, which despite the name, is a frequentist max-marginal likelihood method). That's even better for error control than shrinkage, but it's going to have the "wrong" effect on this notion of p-direction unless you talk about p-direction of the difference from the population estimate, rather than the random effect itself (that is, you don't want to say Connecticut is significantly different than zero, but significantly different than other states).

P.S. For reference, Gelman et al. use a similar, but not equivalent notion, in calculating posterior predictive p-values in Bayesian Data Analysis, but without flipping signs (so that either values near 0 or near 1 are evidence the model doesn't fit the data well). These are not intended to be used in hypothesis tests, though, just as a diagnostic.

mattansb commented 4 years ago

Hi Bob,

Thanks for you comment!

I generally agree with your comment - MAPs have been shown to be poor estimates (but their appeal as "most probable value" / any their close match with MLEs is just too tempting I guess).

The "penalization" placed by the prior is not "good" for error rates, but also for estimation. One failure of freq methods is its lack of incorporation of priors (be they from previous data collection, or from subjective expectations), making all decisions and estimations based on the observed data alone.

We really should add some comment section (disqus for blogdown?)...

strengejacke commented 4 years ago

Bob,

thank you for your comment! I',m not sure if you have noticed our accompanying paper related to the posting, Indices of Effect Existence and Significance in the Bayesian Framework (there was a kind of "mutual inspiration" for both the development the bayestestR package and writing the paper, thus the relationship...).

Deriving a point estimate isn't particularly Bayesian, but at least the posterior mean and median have natural probabilistic interpretations as an expectation and the point at which a 50% probability obtains. With those estimators, results will vary from MAP based on how skewed the distribution is.

I think what is important, from our perspective, not only in Bayesian inference, but also in a frequentist framework, is to take the uncertainty of estimation into account (or even focus on it). Thus we suggest using at least two indices of effect "existence" and "significance" (not necessarily in the sense of statistical significance). The pd is one of these possible indices that works well, in particular due to its 1:1 correspondence to the p-value, which makes it easier to attract Bayesian methods for "non-Bayesians". I guess this blog post could not carry this bigger picture of "focus on uncertainty"...

This leads to underestimates of lower-level regression coefficient uncertainty by construction, as you see in packages like lme4 in R.

You mean shrinkage? But isn't that a desired property of mixed models to avoid overly "strong effects" (estimates)?

When we want to do downstream predictive inference, we don't want to just plug in an estimate, we want to do posterior predictive inference and average over our uncertainty in parameter estimation.

We had a blogpost some months ago, where Aki responded (via email). He wrote about predictive model checking, but though I read some papers from Aki and you (just cursory), I did not fully understand how to best use posterior predictive inference (I hope I understood you correct here).

Part of the point of Bayesian inference is to not have to collapse to a point estimate.

True, but I think this should apply to frequentist framework as well.

A prior defined for a probability variable in [0, 1] that's flat is very different from a flat prior for a log odds variable in (-infinity, infinity).

That's a good point, in particular since we are working on a paper on how to use informative priors in rstanarm, and how different priors affect the model results.

but it's going to have the "wrong" effect on this notion of p-direction unless you talk about p-direction of the difference from the population estimate, rather than the random effect itself

I think I got what you mean. So it would make sense to "warn" the users of p_direction(), that - if applied to random effects - the result indicates the probability of being above/below the "group" average (being states or whatever)...

bob-carpenter commented 4 years ago

I guess this blog post could not carry this bigger picture of "focus on uncertainty"...

That's somthing I'm 100% behind :-)

One failure of freq methods is its lack of incorporation of priors (...), making all decisions and estimations based on the observed data alone.

This isn't true. While a strict philosophical frequentist won't let you talk about a distribution over a parameter (so no prior or posterior distributions), they're totally OK with penalized maximum likelihood. That is, they're often OK introducing some bias for reduced variance. Given parameters theta, data y and likelihood p(y | theta), the max likelihood estimate is

theta* = ARGMAX_{theta} log p(y | theta).

A penalized maximum likelihood estimate just adds a penalty term. For instance, the equivalent of a normal prior would be an L2 penalty, e.g.,

theta* = ARGMAX_theta log p(y | theta) - 0.5 * lambda * SUM_{n in 1:N} theta[n]^2.

The resulting penalized MLE is equivalent to the MAP estimate a Bayesian model with theta assigned a normal(0, lambda) prior.

Here's a link to the classic Efron and Morris (1975) paper about Stein's estimator. They go much further and build a hierarchical model to which they apply max marginal likelihood by marginalizing out the lower-level coefficients and optimizing the hierarchical regularization parameters. This is the same as modern system like lme4 (from R). Although that technique is often called "empirical Bayes", it's a point estimation technique based on optimization and totally kosher among frequentists.

The pd is one of these possible indices that works well, in particular due to its 1:1 correspondence to the p-value

It kind of looks like a p-value, but it doesn't act like a p-value, so I fear this is just going to confuse people. Worse yet, I fear it may encourage them to think in terms of a binary significant/not-signficant decision, which is what we want to leave behind with the frequentists. Instead, we want to do something like decision analysis (e.g., as described in Bayesian Data Analysis) to make decisions, which considers not only posterior uncertainty, but magnitude and its effect on things we care about like utility of decisions.

You mean shrinkage? But isn't that a desired property of mixed models to avoid overly "strong effects" (estimates)?

No, I mean regularization to population means. For example, see the baseball ability estimates from the Efron and Morris paper I link above. The idea is that you only have a few dozen observations per player, so you want to shrink not to zero, but to the population average. Then if you see no data, your guess is the population average, not zero.

The point of mixed models is to figure out how much pooling to do between no pooling (infinite hierarchical variance) on one side to complete pooling (hierarchical variance zero) on the other. I discuss this in my Stan case study that recasts Efron and Morris in Bayesian terms.

where Aki responded (via email)

Glad to hear I'm not the only person obsessing about definitions on the web.

I did not fully understand how to best use posterior predictive inference (I hope I understood you correct here).

The idea is that you average over your posterior, so that's

p(y' | y) = INTEGRAL p(y' | theta) * p(theta | y) d.theta

which in MCMC terms is just

p(y' | y) =approx= 1/M SUM_{m in 1:M} p(y' | theta(m))

where theta(m) is a draw from the posterior `p(theta | y). It's not out yet, but I have a PR that adds new chapters to the Stan user's guide to try to explain this in simpler terms than in sources like Bayesian Data Analysis (of which Aki is a co-author). I also have a simple case study where I compare plugging in the MAP estimate, the posterior mean estimate, or doing full Bayesian posterior predictive inference for a simple logistic regression, and posterior predictive inference dominates other methods when so-called "proper scoring metrics" are used (like log loss or square error).

Part of the point of Bayesian inference is to not have to collapse to a point estimate.

True, but I think this should apply to frequentist framework as well.

How so?

P.S. You might want to compare notes with Rasmus Bååth's work on his Bayesian first aid package, which translates many classical notions like hypothesis tests to Bayesian frameworks.

mattansb commented 4 years ago

Thanks Bob for all the explanations! It wan't clear to me until now how hierarchical shrinkage actually works!

It kind of looks like a p-value, but it doesn't act like a p-value

Can you elaborate on this? To me, they hold the same amount of information (for p-values in NHST at least), and the behave the same in the sense that they are consistent under H1 (p-value gets smaller, and pd gets bigger), but not under H0 (p-value is uniform between 0-1, pd is uniform between 0.5-1). (I totally agree that it might encourage dichotomous thinking, but as @strengejacke said, we recommend not reporting only the pd).

bob-carpenter commented 4 years ago

On Feb 28, 2020, at 2:18 AM, Mattan S. Ben-Shachar notifications@github.com wrote:

Thanks Bob for all the explanations! It wan't clear to me until now how hierarchical shrinkage actually works!

It kind of looks like a p-value, but it doesn't act like a p-value

Can you elaborate on this?

The basis is completely different.

For p-values, the basis is the distribution of an estimator under varying data samples, i.e., p(f(y)) where y = y[1], ..., y[N] is the data and f is some estimator. The data y are assumed to be random variables so that f(y) is also a random variable. For example, the maximum likelihood estimator for standard deviation is

sigma*(y) = sqrt(sum((y - mean(y))^2) / N.

But it's biased to the low side because it's based on the sample mean mean(y). Luckily, there's a well known corection,

sigma-hat(y) = sqrt(sum((y - mean(y))^2) / (N - 1).

These are two different estimators of sigma from data and they will have slightly different properties.

For posteriors, the basis is a distribution of parameters conditional on data, i.e., p(sigma | y[1:N]). This is a distribution over sigma, which is now being considered "random" unlike in the frequentist version. You can ask things like what's the probability that sigma is in some interval given the posterior density, which is not legal in the frequentist methodology. Note that this doesn't mean Bayesians believe there isn't a true value for sigma, just that we treat randomness as representing epistemic uncertainty.

And while the distribution of an estimator of a parameter might look like a Bayesian posterior, they're not interpreted that way. The estimator distribution is used to ask a question about how likely it would be to get an estimate as extreme as you got from random data (that's how p-values are defined). The posterior distribution is used to ask questions like the probability of the true value falling in an interval.

mattansb commented 4 years ago

Right, computation and interpretation are (and should be) different (likelihood vs posterior prob), but it is still hard to ignore the basically 1:1 correspondence between the values, and other properties / behaviors of the p-value and pd (for the lay user, at the very least, and I think that is what we were aiming for here)...

Thanks for engaging! Definitely got a lot to mull over!

bob-carpenter commented 4 years ago

On Feb 28, 2020, at 3:17 PM, Mattan S. Ben-Shachar notifications@github.com wrote:

Right, computation and interpretation are (and should be) different (likelihood vs posterior prob),

The p-value is based on the distribution of the estimator, which is a proper random variable because it's a function of the data, which is random. This is why the bootstrap works, for example.

In contrast, the likelihood isn't even a probability distribution---it's a function of parameters for fixed data.

but it is still hard to ignore the basically 1:1 correspondence between the values, and other properties / behaviors of the p-value and pd (for the lay user, at the very least, and I think that is what we were aiming for here)...

They're not even distributions over the same thing! This is precisely the kind of analogy that I fear will confuse newcomers.

I'd suggest section 4.5 of Gelman et al.'s Bayesian Data Analysis, which is about Bayesian interpretations of other statistical methods. I'd also recommend 4.4, which is about frequency calibration of Bayesain methods, something we recommend for Stan, but is considered bad practice by the so-called objective Bayesians and the so-called subjective Bayesians.