Testing against known models, use mcse instead of absolute tolerance

flatironinstitute / bayes-kit

Bayesian inference and posterior analysis for Python

MIT License

43 stars 3 forks source link

Consider the scenario of testing a sampler against a model where we know the expectations (e.g. any Normal model). The Metropolis sampler has tests along the lines of

model = ...
# set up 
draws = np.array([metropolis.sample()[0] for _ in range(M)])
mean = draws.mean(axis=0)
var = draws.var(axis=0, ddof=1)

np.testing.assert_allclose(mean, model.posterior_mean(), atol=0.1)
np.testing.assert_allclose(var, model.posterior_variance(), atol=0.1)

These are tests with absolute tolerances. Absolute tolerance tests can require a large number of iterations, awkwardly chosen tolerance levels, or both.

I think it makes more sense to evaluate a sampler's draws using a monte carlo standard error, see mcse_mean(x) $= sd(x) / \sqrt{ess\_mean(x)}$ and mcse_std(x) from the stan-dev package posterior. Such monte carlo standard error tests would look something like

model = ...
# setup
draws = np.array([metropolis.sample()[0] for _ in range(M)])
mean = draws.mean(axis=0)
se_mean = mcse_mean(draws, axis = 0) # needs implementation
std = draws.std(axis=0, ddof=1)
se_std = mcse_std(draws, axis = 0) # needs implementation

z = 5
np.all(np.abs( (mean - model.posterior_mean()) / se_mean) < z )
np.all(np.abs( (std - model.posterior_std()) / se_std) < z )

Such tests more naturally incorporate sampler efficiency into testing and validation, and should hopefully remove some of the awkwardly chosen tolerance values.

I agree and that has been the plan all along. We have to keep in mind that we are going to have to set a fudge factor in the Z threshold because we are estimating ESS, which is pretty noisy.

The current tests are nothing but a stopgap so we'd have at least crude tests for samplers. We got the cart out before the horse implementing samplers before implementing R-hat and ESS. I have a PR for ESS that is passing all tests, but there are extensive change requests beyond what I have time to deal with immediately.

ESS PR
- request: break IAT into its own function with its own doc and tests
- request: independent tests for IAT (that aren't related to ESS tests; ESS tests might use utilities in IAT tests; simulations need to change basis)
- requst: fix mypy strict

I also have an ensemble sampler PR that deals with the mypy issues, but has extensive testing requests that I'm not 100% sure how to tackle:

Ensemble sampler PR
- fixes mypy strict
- request: test behavior of changes in API arguments

Any help would be appreciated, as I don't know that I'll have time right away to work on this stuff. It might make sense to do just a mypy patch if the whole package isn't passing at strict. I think it may be a while before the ensemble sampler can be broken into testable pieces, and before doing that, I want to come to some agreement about how strict we are going to be in testing this kind of thing and then make sure our other samplers are up to that standard as well.

flatironinstitute / bayes-kit

Testing against known models, use mcse instead of absolute tolerance #27