Closed marcellobullo closed 1 year ago
I did look into this but the ks_2samp
statistics does compare the underlying continuous distributions F(x) and G(x). In distfit
, the empirical distribution is compared to the pdfs so there is no MLE in between. If you have python code available for the reference, I can look into that.
See ref here.
If there is anything else regarding this issue, re-open or send me a pm.
Hi Erdogan,
Thank you to have checked this issue.
By the way, the issue is in the pdfs you compare your data with. Such pdfs, in fact, are a family of distributions (i.e. Gaussian) whose parameters (i.e. mean and variance) are estimated using MLE (actually the method you use to estimate them doesn't matter, the problem is the fact you are estimating such parameters from your sample). Precisely, this occurs in _compute_score_distribution
, when the function distribution.fit(data)
(https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.rv_continuous.fit.html) is invoked to estimate such parameters. Roughly speaking, comparing data with a distribution whose parameters are estimated from the same data is like "cheating" and the properties of the KS statistics are no longer preserved.
Hope this can help you.
Marcello
I am not following you entirely I think. The distribution and parameters are not estimated from the same data. At one side there is the empirical user input data with unknown distribution, and at the other side are the pdfs with known function. The stats, such as KS, are used to determine the best fit. What am I missing here?
The KS statistic is used to test if the empirical user input data can be seen as generated by a test distribution (no fitting involved), and it is computed between 2 data sequences: one is the empirical user input data and the other is a sequence generated by a test distribution. Let's suppose that the test distribution belongs to Gaussian family, which is the set of all possible Gaussian random variables on varying mean and variance. In order to generate those samples you need a specific gaussian random variable with specific mean and variance. How do you choose mean and variance of your Gaussian pdf to perform the KS test? You use the empirical user input data to "fit" (estimate) the parameters and that is not preserving the property of KS statistics. What you can do is to use parametric/non-parametric bootstrapping, as I suggested
I hope it is clear, but feel free to ask.
Thanks! I will look into this but feel free to send a pull request!
Hi @erdogant, have you gotten a chance to look into this bootstrapping technique? I have read the paper that @marcellobullo linked to and the expressed concern makes sense to me.
I checked the notes in scipy for cramervonmises
and I see a similar caveat:
It is important to keep in mind that the p-value is only accurate if one tests a simple hypothesis, i.e. the parameters of the reference distribution are known. If the parameters are estimated from the data (composite hypothesis), the computed p-value is not reliable.
... but nothing along these lines for ks_2samp
.
Hi @ajbouh @erdogant, you have a similar problem with the ks test when you want to perform a goodness of fitting test. Basically, if you want to test a data sample against a target distribution whose parameters are known (prior to any test), then the ks statistic, when the null hypothesis holds, has a known limiting distribution (Kolmogorov distribution). This is true independently of the target distribution. This is fundamental for the computation of the p-value, because any library uses tabular values. In simply words, if your statistic is too far away from the mean, then you can conclude that your data sample has not been drawn from the same distribution than the target's. Instead, if you estimate the target parameters from your data sample, then you don't have any guarantee that the ks statistic has a Kolmogorov distribution. So, the "canonical" computation of the p-value is no longer valid/reliable. One possible solution, is to estimate the ks statistic distribution using bootstrapping, and use the estimated distribution instead of a Kolmogorov one. Further details are on the paper I referenced. Hope this can be helpful.
Hi @marcellobullo, yes I've gone through the paper. It lays out the problem in a very clear fashion. I was very surprised to see that so few other "distribution fitting" packages in the python ecosystem acknowledge the issue.
I would have thought that the "right" way to do this bootstrapping would be to run ks_2samp
once on the newly generated bootstrap dataset and get a single (less biased) value for the test. I would intuitively expect this to be just a few lines of python code in total.
But I found this implementation: https://github.com/swharden/Bootstrapped-KS2/blob/21753285deffd800d30d9a120e2473d633a4b3cb/src/python/ks.py#L135-L154 and I get the sense that their approach is not exactly the same as what's described in the paper you linked. They do resample the dataset, but they seem to aggregate the resulting ksTest
values with np.mean
and I'm not sure why.
The paper talks about two different bootstrap resampling solutions: Parametric Bootstrap and Nonparametric Bootstrap.
I think I understand the Parametric Bootstrap solution to be very straightforward:
ks_2samp
(source)But I'm having trouble turning the text describing the Nonparametric Bootstrap solution into something I can implement. I don't understand the bias correction it's describing or when/where it should happen.
I also am unsure of when the parametric or nonparametric bootstrap solution should be preferred. Do you have any suggestions here?
@ajbouh I quickly sketched the pseudo code for both the approaches (better if you double-check). The difference is in how you perform the bootstrap re-sampling.
Parametric Bootstrapping
data = ...
target_distr = ...
B = 1000 # bootrstrapping count
n = len(data)
# MLE
target_params = target_distr.fit( data )
# KS test
Dn = kstest( data, target_distr.cdf )
Dns=[]
# Bootstrapping:
# the goal here is to estimate the KS statistic of the target
# distribution when the params are estimated from data
for i in range(B):
# Resample from target distribution
resamples = target_distr.rvs(target_params, n)
target_params _i = target_distr.fit( resamples )
Dn_i = kstest(resamples, target_distr.cdf, args=target_params)
Dns.append(Dn_i)
alpha = 0.05
Dn_alpha = np.quantile(Dns, 1-alpha)
outcome = "rejected" if Dn>Dn_alpha else "Passed"
pvalue = np.sum(Dns>Dn)/B
Non-parametric Bootstrapping
data = ...
target_distr = ...
B = 1000 # bootrstrapping count
n = len(data)
# MLE
target_params = target_distr.fit( data )
# KS test
Dn = kstest( data, target_distr.cdf )
Dns=[]
# Bootstrapping
for i in range(B):
# Resample from data (this requires a bias correction later)
resamples = np.random.choice(data, n)
target_params _i = target_distr.fit( resamples)
Dn_i = kstest(resamples, target_distr.cdf, args=target_params)
# Bias correction
corrected_Dn_i = Dn_i-Dn
Dns.append(corrected_Dn_i )
alpha = 0.05
Dn_alpha = np.quantile(Dns, 1-alpha)
outcome = "rejected" if Dn>Dn_alpha else "Passed"
pvalue = np.sum(Dns>Dn)/B
Hi @marcellobullo,
Thanks for the pseudocodes. The idea behind bootstrapping makes absolute sense not to leake information into the KS-test, when the target distribution is estimated from data.
Has it been implemented in this package, or in other probability density fitting packages?
Thank you all for the discussion and pseudocode! I did find the time to finally implement the bootstrap approach!
The summary plot is also updated with the bootstrap information to give an overview of the results. @marcellobullo the function I created is inspired by your pseudocode. So the bootstrap function does outputs two results, a bootstrap_score
that you define as pvalue and bootstrap_pass
that you define as outcome.
Running the bootstrap is computationally heavy so the default setting is None. I tried various implementations (mappings and parallelization) but I did not notice a huge difference. The parallel function is still in the code but in comments.
I also created a first text draft in the documentation pages.
# Random data
# X = np.random.exponential(0.5, 10000)
# X = np.random.uniform(0, 1000, 10000)
X = np.random.normal(163, 10, 10000)
# Initialize with bootstrapping
dfit = distfit(n_boots=10)
# Fit
results = dfit.fit_transform(X)
# Results
dfit.summary[['name', 'score', 'bootstrap_score', 'bootstrap_pass']]
# Plot
dfit.plot_summary()
Update to the latest version of distfit to use it! Should be version >= 1.6.0
pip install -U distfit
I am closing this issue. Let me know if something is missing.
Hello everyone,
I noticed in the code
erdogant/distfit/distfit.py
that whenever you use the KS statistical test (stats=ks
), you call thescipy.stats.ks_2samp
to test your data against the distribution you estimated through MLE (maximum likelihood estimation). Is that true? If so, this is wrong, because now the KS statistic depends on your data and the test is no longer valid. In such a case, I would recommend you to have a look at parametric/non-parametric bootstrapping to solve the issue. This reference could be useful https://ui.adsabs.harvard.edu/abs/2006ASPC..351..127B/abstract