erdogant / distfit

distfit is a python library for probability density fitting.
https://erdogant.github.io/distfit
Other
370 stars 26 forks source link

KS-test in fitdist #18

Closed marcellobullo closed 1 year ago

marcellobullo commented 2 years ago

Hello everyone,

I noticed in the code erdogant/distfit/distfit.py that whenever you use the KS statistical test (stats=ks), you call the scipy.stats.ks_2samp to test your data against the distribution you estimated through MLE (maximum likelihood estimation). Is that true? If so, this is wrong, because now the KS statistic depends on your data and the test is no longer valid. In such a case, I would recommend you to have a look at parametric/non-parametric bootstrapping to solve the issue. This reference could be useful https://ui.adsabs.harvard.edu/abs/2006ASPC..351..127B/abstract

erdogant commented 2 years ago

I did look into this but the ks_2samp statistics does compare the underlying continuous distributions F(x) and G(x). In distfit, the empirical distribution is compared to the pdfs so there is no MLE in between. If you have python code available for the reference, I can look into that.

See ref here.

erdogant commented 2 years ago

If there is anything else regarding this issue, re-open or send me a pm.

marcellobullo commented 2 years ago

Hi Erdogan,

Thank you to have checked this issue.

By the way, the issue is in the pdfs you compare your data with. Such pdfs, in fact, are a family of distributions (i.e. Gaussian) whose parameters (i.e. mean and variance) are estimated using MLE (actually the method you use to estimate them doesn't matter, the problem is the fact you are estimating such parameters from your sample). Precisely, this occurs in _compute_score_distribution, when the function distribution.fit(data) (https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.rv_continuous.fit.html) is invoked to estimate such parameters. Roughly speaking, comparing data with a distribution whose parameters are estimated from the same data is like "cheating" and the properties of the KS statistics are no longer preserved.

Hope this can help you.

Marcello

erdogant commented 2 years ago

I am not following you entirely I think. The distribution and parameters are not estimated from the same data. At one side there is the empirical user input data with unknown distribution, and at the other side are the pdfs with known function. The stats, such as KS, are used to determine the best fit. What am I missing here?

marcellobullo commented 2 years ago

The KS statistic is used to test if the empirical user input data can be seen as generated by a test distribution (no fitting involved), and it is computed between 2 data sequences: one is the empirical user input data and the other is a sequence generated by a test distribution. Let's suppose that the test distribution belongs to Gaussian family, which is the set of all possible Gaussian random variables on varying mean and variance. In order to generate those samples you need a specific gaussian random variable with specific mean and variance. How do you choose mean and variance of your Gaussian pdf to perform the KS test? You use the empirical user input data to "fit" (estimate) the parameters and that is not preserving the property of KS statistics. What you can do is to use parametric/non-parametric bootstrapping, as I suggested

I hope it is clear, but feel free to ask.

erdogant commented 2 years ago

Thanks! I will look into this but feel free to send a pull request!

ajbouh commented 2 years ago

Hi @erdogant, have you gotten a chance to look into this bootstrapping technique? I have read the paper that @marcellobullo linked to and the expressed concern makes sense to me.

I checked the notes in scipy for cramervonmises and I see a similar caveat:

It is important to keep in mind that the p-value is only accurate if one tests a simple hypothesis, i.e. the parameters of the reference distribution are known. If the parameters are estimated from the data (composite hypothesis), the computed p-value is not reliable.

... but nothing along these lines for ks_2samp.

marcellobullo commented 2 years ago

Hi @ajbouh @erdogant, you have a similar problem with the ks test when you want to perform a goodness of fitting test. Basically, if you want to test a data sample against a target distribution whose parameters are known (prior to any test), then the ks statistic, when the null hypothesis holds, has a known limiting distribution (Kolmogorov distribution). This is true independently of the target distribution. This is fundamental for the computation of the p-value, because any library uses tabular values. In simply words, if your statistic is too far away from the mean, then you can conclude that your data sample has not been drawn from the same distribution than the target's. Instead, if you estimate the target parameters from your data sample, then you don't have any guarantee that the ks statistic has a Kolmogorov distribution. So, the "canonical" computation of the p-value is no longer valid/reliable. One possible solution, is to estimate the ks statistic distribution using bootstrapping, and use the estimated distribution instead of a Kolmogorov one. Further details are on the paper I referenced. Hope this can be helpful.

ajbouh commented 2 years ago

Hi @marcellobullo, yes I've gone through the paper. It lays out the problem in a very clear fashion. I was very surprised to see that so few other "distribution fitting" packages in the python ecosystem acknowledge the issue.

I would have thought that the "right" way to do this bootstrapping would be to run ks_2samp once on the newly generated bootstrap dataset and get a single (less biased) value for the test. I would intuitively expect this to be just a few lines of python code in total.

But I found this implementation: https://github.com/swharden/Bootstrapped-KS2/blob/21753285deffd800d30d9a120e2473d633a4b3cb/src/python/ks.py#L135-L154 and I get the sense that their approach is not exactly the same as what's described in the paper you linked. They do resample the dataset, but they seem to aggregate the resulting ksTest values with np.mean and I'm not sure why.

The paper talks about two different bootstrap resampling solutions: Parametric Bootstrap and Nonparametric Bootstrap.

I think I understand the Parametric Bootstrap solution to be very straightforward:

  1. Resample the dataset from the estimated distribution using ~ 1000 samples. Pass these directly to ks_2samp (source)
  2. There is no step 2.

But I'm having trouble turning the text describing the Nonparametric Bootstrap solution into something I can implement. I don't understand the bias correction it's describing or when/where it should happen.

I also am unsure of when the parametric or nonparametric bootstrap solution should be preferred. Do you have any suggestions here?

marcellobullo commented 2 years ago

@ajbouh I quickly sketched the pseudo code for both the approaches (better if you double-check). The difference is in how you perform the bootstrap re-sampling.

Parametric Bootstrapping

data = ...
target_distr = ...
B = 1000 # bootrstrapping count
n = len(data)

# MLE
target_params = target_distr.fit( data )

# KS test
Dn = kstest( data, target_distr.cdf )

Dns=[]

# Bootstrapping: 
# the goal here is to estimate the KS statistic of the target 
# distribution when the params are estimated from data
for i in range(B):

    # Resample from target distribution
    resamples = target_distr.rvs(target_params, n)

    target_params _i = target_distr.fit( resamples )
    Dn_i = kstest(resamples,  target_distr.cdf, args=target_params)
    Dns.append(Dn_i)

alpha = 0.05
Dn_alpha = np.quantile(Dns, 1-alpha)
outcome = "rejected" if Dn>Dn_alpha else "Passed"
pvalue = np.sum(Dns>Dn)/B

Non-parametric Bootstrapping

data = ...
target_distr = ...
B = 1000 # bootrstrapping count
n = len(data)

# MLE
target_params = target_distr.fit( data )

# KS test
Dn = kstest( data, target_distr.cdf )

Dns=[]

# Bootstrapping
for i in range(B):

    # Resample from data (this requires a bias correction later)
    resamples = np.random.choice(data, n)  

    target_params _i = target_distr.fit( resamples)
    Dn_i = kstest(resamples,  target_distr.cdf, args=target_params)

   # Bias correction
   corrected_Dn_i = Dn_i-Dn
   Dns.append(corrected_Dn_i )

alpha = 0.05
Dn_alpha = np.quantile(Dns, 1-alpha)
outcome = "rejected" if Dn>Dn_alpha else "Passed"
pvalue = np.sum(Dns>Dn)/B
DiTo97 commented 1 year ago

Hi @marcellobullo,

Thanks for the pseudocodes. The idea behind bootstrapping makes absolute sense not to leake information into the KS-test, when the target distribution is estimated from data.

Has it been implemented in this package, or in other probability density fitting packages?

erdogant commented 1 year ago

Thank you all for the discussion and pseudocode! I did find the time to finally implement the bootstrap approach! The summary plot is also updated with the bootstrap information to give an overview of the results. @marcellobullo the function I created is inspired by your pseudocode. So the bootstrap function does outputs two results, a bootstrap_score that you define as pvalue and bootstrap_pass that you define as outcome.

Running the bootstrap is computationally heavy so the default setting is None. I tried various implementations (mappings and parallelization) but I did not notice a huge difference. The parallel function is still in the code but in comments.

I also created a first text draft in the documentation pages.

# Random data
# X = np.random.exponential(0.5, 10000)
# X = np.random.uniform(0, 1000, 10000)
X = np.random.normal(163, 10, 10000)
# Initialize with bootstrapping
dfit = distfit(n_boots=10)
# Fit
results = dfit.fit_transform(X)
# Results
dfit.summary[['name', 'score', 'bootstrap_score', 'bootstrap_pass']]
# Plot
dfit.plot_summary()

image

Update to the latest version of distfit to use it! Should be version >= 1.6.0

pip install -U distfit

erdogant commented 1 year ago

I am closing this issue. Let me know if something is missing.