HPCE / hpce-2017-cw6

2 stars 17 forks source link

Failure of Test if p-value is high ( ~1 ) #20

Closed malharjajoo closed 4 years ago

malharjajoo commented 6 years ago

Hi,

If I understand basics of hypothesis testing correctly, the tests assume that a RNG will produce random output ( null hypothesis H0 ) and if a very small p-value ( close to 0 ) is obtained, then the RNG (and hypothesis) is rejected.

I am not certain why a high ( close to value of 1 ) p-value leads to rejection of RNG (?).

On a related note, bin/stress_std 1000 causes 1 test ( AutoCor ) to fail ( with p-value of 0.9993 ). Here is the console output on running the above command -

========= Summary results of Rabbit =========

 Version:          TestU01 1.2.3
 Generator:        mt19937_1
 Number of bits:   100000000
 Number of statistics:  40
 Total CPU time:   00:00:44.01
 The following tests gave p-values outside [0.001, 0.9990]:
 (eps  means a value < 1.0e-300):
 (eps1 means a value < 1.0e-015):

       Test                          p-value
 ----------------------------------------------
 19  AutoCor                         0.9993
 ----------------------------------------------
 All other tests were passed
m8pple commented 6 years ago

p-values very close to 1.0 are as suspicious as ones that are very close to 0.0, as we can consider it as a two-tailed test:

In the case of auto-correlation, one possibility is that the auto-correlation is too high, which might correspond with the tail on the 0 end. The other possibility is that the auto-correlation is too low - in a random sequence there is always some amount of auto-correlation just be chance, so if there is none, then we should be suspicious as well.

This is a slightly simplistic interpretation, but it reflects some of the stranger aspects of randomness, whereby a stream shouldn't look too non-random, nor should it look too perfectly random.

The p-value of 0.993, is a little troubling, but not too worrying by itself. There are 40 tests, so with 40 uniform random numbers we would expect one of them to relatively high, and another to be relatively low. If you ran the tests a few times and always saw high-values for that test, then it might start becoming an issue.

davidpasztor commented 6 years ago

I can reproduce the same issue. However, it seems to be a weird edge case around Number of bits: 100000000.

========= Summary results of Rabbit =========

 Version:          TestU01 1.2.3
 Generator:        mt19937_1
 Number of bits:   100000000
 Number of statistics:  39
 Total CPU time:   00:01:08.92
 The following tests gave p-values outside [0.001, 0.9990]:
 (eps  means a value < 1.0e-300):
 (eps1 means a value < 1.0e-15):

       Test                          p-value
 ----------------------------------------------
 19  AutoCor                         0.9993
 ----------------------------------------------
 All other tests were passed

If I run the test with Number of bits: 101000000, all tests pass.

========= Summary results of Rabbit =========

 Version:          TestU01 1.2.3
 Generator:        mt19937_1
 Number of bits:   101000000
 Number of statistics:  39
 Total CPU time:   00:01:04.35

 All tests were passed

The tests also pass for a smaller bit number, Number of bits: 99000000

========= Summary results of Rabbit =========

 Version:          TestU01 1.2.3
 Generator:        mt19937_1
 Number of bits:   99000000
 Number of statistics:  39
 Total CPU time:   00:01:03.67

 All tests were passed

@m8pple after looking at above results, I am less sure than before about how you will evaluate the correctness of our implementation. Does it not matter if some of the test cases fail, as long as the overall distribution of the p-values is uniform? Does this mean that each test will be run a large enough times to enable you to decide whether the distribution is U(0,1) or not?

m8pple commented 6 years ago

Does it not matter if some of the test cases fail, as long as the overall distribution of the p-values is uniform?

Yes, this is correct. Indeed this is something that people often misunderstand about hypothesis testing in general, which is that if you apply $n$ different tests, then you should actually expect to see some fail at the $1/n$ significance level - if you don't then it is actually somewhat suspicious.

This is drifting off-topic a little, but it's interesting in an EE and more general scientific sense: there is a strong tendency for people (including engineers and scientists) not to realise how many hypothesis tests they are actually performing. For example, in this scenario, it is likely that during the search use-case people will perform 10000 or more independent hypothesis tests, so we should expect that quite a few of them will "fail" at the 0.999 level. For long time budgets and very efficient implementations there might be up to 100000 p-values generated, so we would need to revise our expectations to see quite a few values at the 0.9999 level.

I've seen this happen in multiple contexts - a few years ago students were asking about grade distributions across 4th year modules, so as an experiment I did some z-tests for difference in means across the modules and showed the overall p-values to some students. Almost all the students concluded that the ones with a big difference in mean and high/low p-value were significant (implying they were easier/harder), but only 1 or 2 realised that if you do that many hypothesis tests (about 40 modules) you should expect those deviations by chance. There was a similar fallacy with historical module averages, where students interpreted four years of slightly above mean averages as meaningful, while ignoring the fact that across 40 modules you would expect some modules to be consistently above the mean, some modules to be consistently below the mean, and most of them to be bouncing about above and below (which is what the data looks like). There are lot's of other confounding factors like moderation and population size, but the kinds of summary statistics that people think are most useful (usually means and historical means) are often quite misleading when you are looking at lots of them.

This fallacy is also very common in the scientific community, where it's often called p-hacking or fishing. People take one set of data, and keep running different hypothesis tests on it until one of them gives them a "significant" result. But it is to be expected that if you run 100 different tests on the same data, eventually one of them produces something at the 0.05 level, which used to be enough to publish. Biologists seem to be particularly prone to this, so there has been a big shift away from this in that field. However, I've had to reject papers that I've reviewed where engineers have done the same thing - nothing to do with random numbers, it's just they didn't understand that their experimental setup was almost guaranteed to result in a "significant" p-value.

Does this mean that each test will be run a large enough times to enable you to decide whether the distribution is U(0,1) or not?

Yes :)

giuliojiang commented 6 years ago

Following up on:

Does this mean that each test will be run a large enough times to enable you to
decide whether the distribution is U(0,1) or not?

The generator in the workload seems to be deterministic given a specific n. Does this mean you will run stress multiple times with different time budgets? My implementation is not deterministic in choosing n given the t but I imagine some implementations might always choose the same n given a fixed t.

m8pple commented 6 years ago

The general principle is that if something looks suspicious (passes when it probably shouldn't, or fails when it probably shouldn't), then the general principle is to replicate the same experiment with the same conditions, but a different seed (which is something I can force).

As you say, some people will choose different values of n for the same t, so the tests won't have exactly the same statistical power. However for similar values of n you would tend to expect similar pass or fail behaviour, plus I'll usually be choosing generators where I have a very good idea about whether a generator should be passing or failing for a particular n.

Ultimately, there is a human on the end of things, so the ultimate recourse is that in the very small number of cases where it's not clear if a test is correct or not, I'll just look at the code.