JuliaStats / Distributions.jl

A Julia package for probability distributions and associated functions.
Other
1.1k stars 414 forks source link

Testing with Monte Carlo #143

Open nfoti opened 11 years ago

nfoti commented 11 years ago

Some tests are performed with Monte Carlo estimates. For example, in test/matrix.jl the rand function for the Wishart and InverseWishart distributions are tested by generating a lot of samples and checking that the mean is "close". The closeness value should probably be chosen to be two standard-errors based on the number of samples, and in this case the Monte Carlo experiment should be repeated 20 times and be allowed to have too large an error once in those 20 trials (as the closeness value is only an approximately 95% interval).

StefanKarpinski commented 11 years ago

Would it make sense to just increase the number of samples? Isn't that statistically equivalent?

simonbyrne commented 11 years ago

I have been thinking that we should make the tests more statistical (currently the only one is the univariate Kolmogorov-Smirnov test for checking rand vs cdf). However I don't think simply repeating the test 20 times makes sense: you will lose power, and anyway 26% of the time 2 or more tests will be outside the 95% interval.

Some options are straightforward

Higher-order moments (var, skewness, kurtosis) are tricky (off the top of my head, I don't know of any obvious tests here), but given the huge sample sizes we're using, a t-test is probably fine as long as the 2nth moment exists when checking the nth moment.

Multivariate ones are tricky as well: you could either try to exploit known properties of the distributions, or do the tests element-wise and combine via some sort of multiple-comparisons correction, such as Bonferroni or FDR. We could also do an additional multiple-comparisons check across the whole suite so you only have to look at one number when running the tests.

StefanKarpinski commented 11 years ago

There are two use cases for these kinds of tests:

  1. To verify that you implemented something correctly in the first place.
  2. To catch regressions where some change causes something that used to work to stop working correctly.

For the first case, you want to run a lot of samples so that you can be really sure that you're doing it right. You can use truly random seeds and failures should be examined manually to determine if they are flukes or real problems.

For the second case, the tests should be run with a high p-value and a seed that we know succeeds. That way the tests are deterministic and and any failures are real regressions, not unlucky seed values. We may want to run the tests with a few known-good seed values or once with more samples and a more stringent significance threshold – I'm not sure. That's probably a matter of striking a balance between being robust to slight computational deviations (from e.g. platform-specific differences of roundoff when a register value is swapped to memory) and being sensitive to real regressions.

So basically randomness tests need two modes: development mode with real random seeds, huge samples and very stringent significance thresholds; and regression mode with a variety of known-good fixed seeds used to generate medium samples with moderate significance thresholds.

nfoti commented 11 years ago

I agree that random seeds when doing development and known successful seeds for regression testing is necessary.

lindahua commented 11 years ago

I agree with Stefan on this. Extensive testing and then fixing a seed to prevent accidental travis failure.

StefanKarpinski commented 11 years ago

I should also add that for initial correctness testing you want some failures – at about the same rate as the p-value you've chosen. For regression testing, on the other hand, you do not want any failures.

johnmyleswhite commented 11 years ago

It seems like the matrix tests that use RNG's should be pulled into a separate suite of tests, just as Dahua did with the univariate and multivariate tests, which we were very extensive and time-consuming.