Closed BaronKhan closed 6 years ago
on a similar topic, can independent tests in a battery use copies of the generator, ie 2 different tests can use copies of a single generator so that they are looking at the same stream of randoms instead of test 1 gets a stream of randoms from the generator, then test2 analyzes the stream of randoms from where test 1 has finished?
To illustrate: if the random numbers are
XXXXYYYY
Reference test 1 analyzes XXXX
Reference test 2 analyzes YYYY
Can we transform so that both test 1 and test 2 use XXXX
?
Sorry, I missed this one.
For @BaronKhan : In a statistical sense p-values are a bit odd, as both 0.1 and 0.9 could be correct for the same test and the same RNG, as for some tests it is legitimate to do things in different order and not affect the hypothesis-testing power of the test. However, if the test originally gave a p-value of 0.1, but after modification gave 1e-6, then you would probably consider the test broken - previously there was a 1 in 10 chance of seeing a statistic that was that non-random, which is vastly different to a p-value of a 1 in a million chance of seeing something that bad. Similarly, if originally the test-statistic was 1e-7, but after modification it jumps to 1e-2, then again you might question whether the statistical power of the test has been broken.
So in answer to the question:
For instance, if a p value for a test differs by 0.01 compared to the original, is it still considered correct? Then it would be yes, as long as the original p-value wasn't 0.01, and the new p-value is 0. So I would modify it to something more like:
- p : the original p-value
- p' : the new p-value
- delta = (p-p') / min(p,1-p)
- If delta < 0.1 then you are almost certainly still calculating the right statistic. The min operation is to consider p-values which are very close to 0, and also those that are very close to 1.
e.g. for p=0.4001, p=0.4002, you would get delta=2.4e-4, which would be fine. But for p=0.01, p=0.005, you would get delta=0.5, so I would be inclined to run the test again with a different seed; similarly for p=0.999 and p'=0.9999, then you get delta = 9, so you might want to double-check by running again with a different RNG seed.
Note: this is not how I assess it, it is more to explain how you might interpret the statistics. It is mainly restating explicitly things that were implicit in your previous statistics modules regarding hypothesis testing.
Ultimately, what should happen from a statistical point of view is that for a given random number generator, starting at any seed:
Doing deep transformations on the tests themselves requires more thought (though can have a lot of benefit), so you don't want to be doing it in more than one or two tests that really matter.
Regarding @giuliojiang's question : yes, that is fine.
In response to the answer to @giuliojiang 's question, doesn't this mean that all tests will only test a single contiguous sequence output of the RNG ? What about the remaining sequence ( would we not want to ideally test over a larger/longer RNG output ? )
So this is due to the intrinsically statistical nature of the problem. Putting it in signal processing
terms, if I have an infinitely long sequence of Gaussian random samples, then I could take a window
of n
samples anywhere in the stream and calculate the local spectrum. If I slide that window up
and down the stream, the spectrum will slowly change, as some samples enter the window, and
other samples leave the window. If I jump that window around to completely different areas, then
the local spectrum might change a lot - however, wherever we put that window we would expect
the spectrum to be flat, with minor variations due to local variations.
So we can rely on the fact that noise should be wide-sense stationary (remember 2nd year?). So it doesn't matter whether we look at $x_1...x_n$, or $x_2...x_n$ - if it is actually random, then every part would be expected to be as random as every other part.
Once we take that to testing, assume we have two tests, $f_1$ and $f_2$. Each of these looks at two different aspects of randomness, so for example one looks at covariance, and the other looks at hamming weight. Given a particular time-budget, we can test $n$ random samples, so we could choose to do:
If the wide-sense stationary property applies, and the tests are largely orthogonal in what they test for, then there should be no real difference between the two.
I was just wondering what is the tolerance for how different our p values can be for a test compared to the original library implementation.
For instance, if a p value for a test differs by 0.01 compared to the original, is it still considered correct?