p-value difference tolerance

BaronKhan commented 7 years ago

I was just wondering what is the tolerance for how different our p values can be for a test compared to the original library implementation.

For instance, if a p value for a test differs by 0.01 compared to the original, is it still considered correct?

giuliojiang commented 7 years ago

on a similar topic, can independent tests in a battery use copies of the generator, ie 2 different tests can use copies of a single generator so that they are looking at the same stream of randoms instead of test 1 gets a stream of randoms from the generator, then test2 analyzes the stream of randoms from where test 1 has finished?

To illustrate: if the random numbers are

XXXXYYYY

Reference test 1 analyzes XXXX Reference test 2 analyzes YYYY Can we transform so that both test 1 and test 2 use XXXX ?

m8pple commented 6 years ago

Sorry, I missed this one.

For @BaronKhan : In a statistical sense p-values are a bit odd, as both 0.1 and 0.9 could be correct for the same test and the same RNG, as for some tests it is legitimate to do things in different order and not affect the hypothesis-testing power of the test. However, if the test originally gave a p-value of 0.1, but after modification gave 1e-6, then you would probably consider the test broken - previously there was a 1 in 10 chance of seeing a statistic that was that non-random, which is vastly different to a p-value of a 1 in a million chance of seeing something that bad. Similarly, if originally the test-statistic was 1e-7, but after modification it jumps to 1e-2, then again you might question whether the statistical power of the test has been broken.

So in answer to the question:

For instance, if a p value for a test differs by 0.01 compared to the original, is it still considered correct? Then it would be yes, as long as the original p-value wasn't 0.01, and the new p-value is 0. So I would modify it to something more like:

p : the original p-value

p' : the new p-value

delta = (p-p') / min(p,1-p)

If delta < 0.1 then you are almost certainly still calculating the right statistic. The min operation is to consider p-values which are very close to 0, and also those that are very close to 1.

e.g. for p=0.4001, p=0.4002, you would get delta=2.4e-4, which would be fine. But for p=0.01, p=0.005, you would get delta=0.5, so I would be inclined to run the test again with a different seed; similarly for p=0.999 and p'=0.9999, then you get delta = 9, so you might want to double-check by running again with a different RNG seed.

Note: this is not how I assess it, it is more to explain how you might interpret the statistics. It is mainly restating explicitly things that were implicit in your previous statistics modules regarding hypothesis testing.

Ultimately, what should happen from a statistical point of view is that for a given random number generator, starting at any seed:

Tests that previously passed (p \in [0.001,0.999] should still pass most of the time.
Tests that previously failed catastrophically ( min(p,1-p) < 1e-9 ) should still fail.

Doing deep transformations on the tests themselves requires more thought (though can have a lot of benefit), so you don't want to be doing it in more than one or two tests that really matter.

Regarding @giuliojiang's question : yes, that is fine.

malharjajoo commented 6 years ago

In response to the answer to @giuliojiang 's question, doesn't this mean that all tests will only test a single contiguous sequence output of the RNG ? What about the remaining sequence ( would we not want to ideally test over a larger/longer RNG output ? )

m8pple commented 6 years ago

So this is due to the intrinsically statistical nature of the problem. Putting it in signal processing terms, if I have an infinitely long sequence of Gaussian random samples, then I could take a window of n samples anywhere in the stream and calculate the local spectrum. If I slide that window up and down the stream, the spectrum will slowly change, as some samples enter the window, and other samples leave the window. If I jump that window around to completely different areas, then the local spectrum might change a lot - however, wherever we put that window we would expect the spectrum to be flat, with minor variations due to local variations.

So we can rely on the fact that noise should be wide-sense stationary (remember 2nd year?). So it doesn't matter whether we look at $x_1...x_n$, or $x_2...x_n$ - if it is actually random, then every part would be expected to be as random as every other part.

Once we take that to testing, assume we have two tests, $f_1$ and $f_2$. Each of these looks at two different aspects of randomness, so for example one looks at covariance, and the other looks at hamming weight. Given a particular time-budget, we can test $n$ random samples, so we could choose to do:

$f_1(x1...x{n/2})$ and $f2(x{n/2+1},x_n)$; or
$f_1(x1...x{n/2})$ and $f_2(x1...x{n/2})$; or

If the wide-sense stationary property applies, and the tests are largely orthogonal in what they test for, then there should be no real difference between the two.

HPCE / hpce-2017-cw6

p-value difference tolerance #25