ChiSquare error being thrown: can an explanation be provided instead?

javadba commented 3 years ago

I am trying to use this library more or less as either a binary indicator of "benford or not" or a probability indicator of same. So any distribution should be possible to send into it. If the distribution is weird - then say "sorry, nope."

Instead consider:

bl = benfordslaw(alpha=0.05)
x = np.linspace(0,1000,1001)
x = np.append(x,[1,1,1,1,1,1,])
isben2 = bl.fit(x)
print(f"isben2 {isben2}")

Instead of a "nope" we get:

ValueError: For each axis slice, the sum of the observed frequencies must agree with the sum of the expected frequencies to a relative tolerance of 1e-08, but the percent differences are: 0.0009950248756218905

Note that even just using x without the extra np.append() results in the same error. So .. what does this mean? Should I add my own code to catch that exception and then say "nope" ? The problem with that is we don't get any probability and also it is unclear whether that exception were due to some other unexplained data problem.

fyi the entire stacktrace is

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3441, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-7-8d38f0d591c2>", line 4, in <module>
    isben2 = bl.fit(x)
  File "/usr/local/lib/python3.9/site-packages/benfordslaw/benfordslaw.py", line 109, in fit
    tstats, Praw = chisquare(counts_emp, f_exp=counts_exp)
  File "/usr/local/lib/python3.9/site-packages/scipy/stats/stats.py", line 6852, in chisquare
    return power_divergence(f_obs, f_exp=f_exp, ddof=ddof, axis=axis,
  File "/usr/local/lib/python3.9/site-packages/scipy/stats/stats.py", line 6694, in power_divergence
    raise ValueError(msg)
ValueError: For each axis slice, the sum of the observed frequencies must agree with the sum of the expected frequencies to a relative tolerance of 1e-08, but the percent differences are:
0.0009950248756218905

erdogant commented 3 years ago

thanks for reporting! I will look into it.

erdogant commented 3 years ago

I can not reproduce the error. Can you tell me the version you are using?

import benfordslaw
print(benfordslaw.__version__)

Should be >= 1.0.3 I also included a boolean output in the latest release (1.0.3) using the key P_significant. You can do the "sorry, nope." now.

I get the following results when using your code:

import numpy as np
from benfordslaw import benfordslaw
bl = benfordslaw(alpha=0.05)
x = np.linspace(0,1000,1001)
x = np.append(x,[1,1,1,1,1,1,])
isben2 = bl.fit(x)
print(f"isben2 {isben2}")
print(f"P_significant: {isben2['P_significant']}")
if not isben2['P_significant']:
    print("sorry, nope.")

[benfordslaw] >Analyzing digit position: [1]
[benfordslaw] >[chi2] Anomaly detected! P=5.47798e-80, Tstat=393.145
isben2 {'P': 5.4779835775992096e-80, 't': 393.14541596537117, 'percentage_emp': array([[ 1.        , 11.72962227],
       [ 2.        , 11.03379722],
       [ 3.        , 11.03379722],
       [ 4.        , 11.03379722],
       [ 5.        , 11.03379722],
       [ 6.        , 11.03379722],
       [ 7.        , 11.03379722],
       [ 8.        , 11.03379722],
       [ 9.        , 11.03379722]])}
P_significant: True

javadba commented 3 years ago

import benfordslaw
   ...: print(benfordslaw.__version__)
1.0.2

I pip3 install'ed from pypi 2 days ago. So can you update pypi?

erdogant commented 3 years ago

update with:

pip install -U benfordslaw

javadba commented 3 years ago

Did the update:

In [7]: import benfordslaw
   ...: print(benfordslaw.__version__)
1.0.3

But I get the same original result

ValueError: For each axis slice, the sum of the observed frequencies must agree with the sum of the expected frequencies to a relative tolerance of 1e-08, but the percent differences are: 0.0009950248756218905

erdogant commented 3 years ago

Well this one is tricky apparently. More information about it can be found here. It is a feature, not a bug.

You can either change your input slightly (remove one of the 1s)

x = np.linspace(0,1000,1001)
x = np.append(x,[1,1,1,1,1])

Or you can use another method:

bl = benfordslaw(alpha=0.05, method='ks') I created a new update that will inform better about what to do in such case.

pip install -U benfordslaw

erdogant commented 2 years ago

I created a small modification by removing the rounding of the expected counts and keeping the values exact. Therefore, it does not throw this error is anymore!

update with:

pip install -U benfordslaw

erdogant / benfordslaw

ChiSquare error being thrown: can an explanation be provided instead? #7