asteca / ASteCA

Code for the ASteCA package.
http://asteca.github.io/
MIT License
18 stars 6 forks source link

Add posterior predictive check (Bayesian p-value) #419

Closed Gabriel-p closed 2 months ago

Gabriel-p commented 5 years ago

The posterior predictive checks (PPCs) are:

in simple words, "simulating replicated data under the fitted model and then comparing these to the observed data"

From PyMC3:

Posterior predictive checks (PPCs) are a great way to validate a model. The idea is to generate data from the model using parameters from draws from the posterior.

Elaborating slightly, one can say that PPCs analyze the degree to which data generated from the model deviate from data generated from the true distribution. So often you will want to know if, for example, your posterior distribution is approximating your underlying distribution. The visualization aspect of this model evaluation method is also great for a ‘sense check’ or explaining your model to others and getting criticism.

This is apparently related to the Bayesian p-value:

a Bayesian p-value is the comparison of a some metric calculated from your observed data with the same metric calculated from your simulated data (being generated with parameters drawn from the posterior distribution)

In this video the relation is briefly explained.

This could be used as a goodness-of-fit test for the final best fit synthetic cluster found:

The p-value can provide a useful diagnostic of goodness of fit

My doubts are how to pick the test statistic:

where high values indicate unlikely outcomes The choice of test statistic T is the only degree of freedom and has to be made given the model.

and how to properly interpret the p-value:

This probability is the p-value and if the probability of observing a more extreme test statistic is small we should rightly be suspicious of the assumed model.

One could also use graphical PPCs, but I'm not sure how any of those would apply to N-dimensional data.


Method (check if this is correct):

  1. Given the posterior for each parameter, generate a random draw for each
  2. Use this random parameter vector to generate new data (rep), via the model (likelihood)
  3. Compare this new data with my observed data (obs) through some test statistic T
  4. Obtain the p-value as: p = P(T_rep>T_obs)
Gabriel-p commented 2 months ago

Task left to the user now