Open roualdes opened 1 year ago
I agree and that has been the plan all along. We have to keep in mind that we are going to have to set a fudge factor in the Z threshold because we are estimating ESS, which is pretty noisy.
The current tests are nothing but a stopgap so we'd have at least crude tests for samplers. We got the cart out before the horse implementing samplers before implementing R-hat and ESS. I have a PR for ESS that is passing all tests, but there are extensive change requests beyond what I have time to deal with immediately.
I also have an ensemble sampler PR that deals with the mypy issues, but has extensive testing requests that I'm not 100% sure how to tackle:
Any help would be appreciated, as I don't know that I'll have time right away to work on this stuff. It might make sense to do just a mypy patch if the whole package isn't passing at strict. I think it may be a while before the ensemble sampler can be broken into testable pieces, and before doing that, I want to come to some agreement about how strict we are going to be in testing this kind of thing and then make sure our other samplers are up to that standard as well.
Consider the scenario of testing a sampler against a model where we know the expectations (e.g. any Normal model). The Metropolis sampler has tests along the lines of
These are tests with absolute tolerances. Absolute tolerance tests can require a large number of iterations, awkwardly chosen tolerance levels, or both.
I think it makes more sense to evaluate a sampler's draws using a monte carlo standard error, see mcse_mean(x) $= sd(x) / \sqrt{ess\_mean(x)}$ and mcse_std(x) from the stan-dev package posterior. Such monte carlo standard error tests would look something like
Such tests more naturally incorporate sampler efficiency into testing and validation, and should hopefully remove some of the awkwardly chosen tolerance values.