Open fardog opened 10 years ago
What you might consider doing is writing down classes of bugs that you have made or hypothesize someone implementing the sampler might make, and write statistical tests that have high statistical power to detect those bugs when tuned to have an acceptably low -- say, one in a thousand -- spurious failure rate (also rather confusingly called 'significance level' in the frequentist binary hypothesis testing paradigm) on a true uniform distribution.
(You can take any old statistical test for a uniform distribution, of course, of which there are approximately umpteen gazillion parroted in applications like dieharder, but it's largely a waste of energy if it doesn't serve much of a purpose to detect plausible bugs.)
There should be a test suite for determining the quality of the random numbers generated. I don't yet know what would be involved here. It should be a separately run test, and not part of the CI tests.