elizabethpankratz / bayes_stat

0 stars 0 forks source link

Lesson 1: Bayesian reasoning, anatomy, priors, and frequentism #2

Closed elizabethpankratz closed 1 year ago

elizabethpankratz commented 1 year ago

Misc notes:

Familiar territory, discussion-prompting questions: What do we know about frequentist analyses?

elizabethpankratz commented 1 year ago

Starting point/discussion: What do we know about Bayesianism vs. frequentism? What have you heard about how frequentists think about probability?

Posteriors, Bayesian updating, and convergence to the data

As many good stats classes do, we'll start out with an example with flipping a coin. Data: Flip a coin 100 times, get 80 heads.

Freq: MLE of probability of getting heads = 0.8. And that's it—point estimate. (LR to see if LH of 80/100 is greater under MLE model vs. null model.)

Bayes: we're keeping in mind all possible probabilities, not just the maximum likelihood one, and we're going to develop a probability distribution over all the possible probabilities of getting heads. How wide this distrib is reflects how certain we are about our estimation. Here's how it might look: [plot of beta pstr over p, peak at 0.8]

How do we get there? Let's backtrack and think about observing one coin flip at a time, and let's do this maybe just for the first eight flips.

We'll begin with the assumption that all values of p are equally probable, since we don't know anything about this coin in advance.

[simulate updating posterior every time a la McElreath 2020 Fig 2.5 p. 30)

OK, so this is how Bayesian updating works. We have some initial belief about how likely diff values of p are, and as the data come in, we update this belief incrementally. This initial belief is the prior, and here it was uniform.

Let's see it with a different prior, now one where we really strongly believe that we have a fair coin.

[example with very precise prior—gets swayed less]

Priors are one of the mystical things about Bayesian reasoning. Since priors can affect your predictions, doesn't that seem like an unscientific way to do data analysis? Well, for one, if you were gonna fuck with your analysis, doing it using priors would be an embarrassingly transparent way to do it—an easy way to get caught. And for another, as the amount of data increases, the model's reliance on the priors decreases.

[show what happens with both priors as we get close to 80/100 observations—or however many we need for data to overwhelm prior]

Notation and anatomy

Now, let's formalise these models using some typical notation and talk in more detail about the bits and pieces in the model.

Model 1:

heads_n ~ Bernoulli(p)
p ~ Uniform(0, 1)

AKA: p ~ Beta(1) I think

Model 2:

heads_n ~ Bernoulli(p)\\
p ~ beta(10) or whatever

These words might sound familiar, because they also appear in Bayes' Theorem.

Note: We run this model backward—make inferences about p based on data. Bayesian models are also generative—they can be run forward to generate even more data that the model thinks is plausible based on the parameters we give it. In fact, running the models forward is gonna be a useful check for figuring out, e.g., whether we have chosen decent priors.

Describing posteriors

Posteriors are visualisable, but also they're a distribution over a random variable that we can just summarise. When you print a summary table from a Bayesian model, what you get out are summary statistics for the posterior distribution.

We like to summarise things with measures of central tendency and measures of dispersion. We can do the same thing here. Model summaries typically give us the mean (central tendency) and a few different measures of dispersion. The one that gets reported most is the highest-density interval under the curve: the narrowest region that still contains some amount of probability mass. Bayesians often use the same number, 95%, and this is computed as the 97.5th quantile minus the 2.5th quantile, to give us the central 95% of the distribution.

OK, so we have a measure of central tendency and dispersion. How to interpret? Mean = posterior mean, easy. Dispersion: what does the posterior represent? It represents how probable it is that p takes on each one of these different values between 0 and 1 (given the model we are using). Because the posterior is a probability distribution, it "adds up" to 1. All possibilities are contained within this distribution. And when we take the 95% highest-density interval, there's a 95% chance that the actual value of p is within the interval we've chosen. This is called the "95% credible interval", abbrev'd as 95% CrI.

This looks suspiciously similar to the 95% CI of frequentist fame. Can anyone articulate why they are different.

Talking about posterior estimates and making inferences

We might be tempted to try to make statements like "0.5 is not contained within the 95% CrI, therefore we reject the null hypothesis that this is a fair coin". That's not licensed—we aren't interested in binary decisions like rejecting the null or not. Binary decisions lose a lot of interesting information. Bayesian modelling is more about estimation than hypothesis testing (tho you can do hyp testing also, but that's an extra step—not built in like in freq models).

All this estimate says is that we can be 95% certain that the true value of p is located between [lower, upper]. The narrower this interval is, the more certain we are. The broader the interval, the less certain. If 0.5 is contained in the 95% CrI, then the model thinks it's possible that we have a fair coin.

How models identify posteriors

MCMC. Crater gif.

Recap

Recap page: tabular, the analogues to frequentist ideas in Bayes.

elizabethpankratz commented 1 year ago

YOU NEED

GO SEARCH


Graphically, plotting all these probabilities might look like: [beta distrib a = 80, b = 20] Pstr.

How do we get there? Imagine observing every trial in the experiment one by one, and letting that update the distribution over possible probabilities of success.

[uniform ex]

Let's say we're not very optimistic about people learning---let's say we expect them to be at chance. Then our prior might be [narrow around 0.5].

[narrow prior ex]

Influence of prior and data. Effect of prior belief on posterior belief.

Mystical.

FIND TAKE

Let's formalise this.

[model specs]

Likelihood: how data is distributed. Prior: how the parameters that specify the likelihood are distributed.

(Familiar bc parts of Bayes' Theorem)


Bayesian models as generative models

[graphic]

[simulation code]


Interpreting posteriors

What parameter values does the model consider plausible? Some conventions: 95% CrI.

BUT: it's about what values are contained, not whether or not this interval contains zero.


This was an illustrative example. How does this work computationally, when we're trying to identify posteriors based on many more parameters?

Algorithm: MCMC.

RETURN CHANGE

Implications of this way of thinking that are different from what we might be used to from frequentist training: