X-DataInitiative / tick

Module for statistical learning, with a particular emphasis on time-dependent modelling
https://x-datainitiative.github.io/tick/
BSD 3-Clause "New" or "Revised" License
491 stars 108 forks source link

Test of random sampling #499

Closed claudio-ICL closed 1 year ago

claudio-ICL commented 1 year ago

This PR is about the tests of sampling functions from lib/cpp/random.

Three issues are addressed:

PhilipDeegan commented 1 year ago

@Mbompr some of the random tests have been failing since updating things for reasons unknown to me, please have a look and tell us what you think

claudio-ICL commented 1 year ago

Hi @Mbompr Could you please give some clarity about the tests on the random number generators?

My understanding is that we should do the following. Assume we want to test that samples in the vector x are drawn from the exponential distribution, with unit rate. The null hypothesis would be that the samples in x are indeed drawn from Expon(1). Then, we should use kolmogorov smirnoff test and compare the empirical cdf of x with the theoretical cdf. The test would return a p-value. If indeed x is drawn from Expon(1), then the p-value should indicate that the null hypothesis cannot be rejected: the higher the p-value, the more confident we are in saying that the null hypothesis is true.

As a benchmark, one could do the following:

>>> from scipy import stats
>>> stats.kstest('expon', 'expon')
KstestResult(statistic=0.17823022417117995, pvalue=0.49370343902508884)

Or:

>>> rvs = stats.expon.rvs(size=10000)
>>> stats.kstest(rvs, 'expon')
KstestResult(statistic=0.005253711831793573, pvalue=0.9439901815922778)

Therefore, we should assert that the p-value calculated on our sample x is greater than some threshold t.

Mbompr commented 1 year ago

Sorry that is a bit old and I am not very proud of these tests that simply consist in checking that the C++ random number generator is working as expected... The tiny extra robustness they add to the project does not compensate the extra complexity they bring... From what I remember from statistical tests, your remarks look correct. But we can as well skip them if that makes things easier.

claudio-ICL commented 1 year ago

Hi @Mbompr It is good to have those tests. They were however a bit confusing and / or with too stringent requirements.

I would say that we modify the statistical tests based on Kolmogorov- Smirnov and on Chi-square as in this PR. In addition, I wrote the script tick.random.tests.qqplots which helps visualise how good the sampling from uniform distribution, gaussian distribution, exponential distribution, and Poisson distribution are. I think that overall the sampling is good.