Open nfoti opened 11 years ago
Would it make sense to just increase the number of samples? Isn't that statistically equivalent?
I have been thinking that we should make the tests more statistical (currently the only one is the univariate Kolmogorov-Smirnov test for checking rand
vs cdf
). However I don't think simply repeating the test 20 times makes sense: you will lose power, and anyway 26% of the time 2 or more tests will be outside the 95% interval.
Some options are straightforward
median
can be checked easily by counting the number of random samples greater than the theoretical median (which has a Binomial(n,0.5)
distribution).mean
can be checked via a simple 1-sample t-test (assuming finite variance).Higher-order moments (var
, skewness
, kurtosis
) are tricky (off the top of my head, I don't know of any obvious tests here), but given the huge sample sizes we're using, a t-test is probably fine as long as the 2n
th moment exists when checking the n
th moment.
Multivariate ones are tricky as well: you could either try to exploit known properties of the distributions, or do the tests element-wise and combine via some sort of multiple-comparisons correction, such as Bonferroni or FDR. We could also do an additional multiple-comparisons check across the whole suite so you only have to look at one number when running the tests.
There are two use cases for these kinds of tests:
For the first case, you want to run a lot of samples so that you can be really sure that you're doing it right. You can use truly random seeds and failures should be examined manually to determine if they are flukes or real problems.
For the second case, the tests should be run with a high p-value and a seed that we know succeeds. That way the tests are deterministic and and any failures are real regressions, not unlucky seed values. We may want to run the tests with a few known-good seed values or once with more samples and a more stringent significance threshold – I'm not sure. That's probably a matter of striking a balance between being robust to slight computational deviations (from e.g. platform-specific differences of roundoff when a register value is swapped to memory) and being sensitive to real regressions.
So basically randomness tests need two modes: development mode with real random seeds, huge samples and very stringent significance thresholds; and regression mode with a variety of known-good fixed seeds used to generate medium samples with moderate significance thresholds.
I agree that random seeds when doing development and known successful seeds for regression testing is necessary.
I agree with Stefan on this. Extensive testing and then fixing a seed to prevent accidental travis failure.
I should also add that for initial correctness testing you want some failures – at about the same rate as the p-value you've chosen. For regression testing, on the other hand, you do not want any failures.
It seems like the matrix tests that use RNG's should be pulled into a separate suite of tests, just as Dahua did with the univariate and multivariate tests, which we were very extensive and time-consuming.
Some tests are performed with Monte Carlo estimates. For example, in
test/matrix.jl
therand
function for theWishart
andInverseWishart
distributions are tested by generating a lot of samples and checking that the mean is "close". The closeness value should probably be chosen to be two standard-errors based on the number of samples, and in this case the Monte Carlo experiment should be repeated 20 times and be allowed to have too large an error once in those 20 trials (as the closeness value is only an approximately 95% interval).