biolab / orange3

🍊 :bar_chart: :bulb: Orange: Interactive data analysis
https://orangedatamining.com
Other
4.85k stars 1.01k forks source link

Discretize: uncaught error and strange binning #5253

Closed ajdapretnar closed 3 years ago

ajdapretnar commented 3 years ago

First issue: Naive Bayes fails with a strange error (AssertionError without context). After investigation, the issue is in Discretize, which tries to discretize to 4 bins where there are only 3 values. Second issue: Discretize returns strange (likely wrong) bins.

discretize-error.zip

First error:

Traceback (most recent call last):
  File "/Users/ajda/opt/miniconda3/envs/o3/lib/python3.8/site-packages/orangewidget/gui.py", line 607, in eventFilter
    self.__onEditingFinished()
  File "/Users/ajda/opt/miniconda3/envs/o3/lib/python3.8/site-packages/orangewidget/gui.py", line 551, in __onEditingFinished
    self.__commitValue(value)
  File "/Users/ajda/opt/miniconda3/envs/o3/lib/python3.8/site-packages/orangewidget/gui.py", line 565, in __commitValue
    self.cfunc()
  File "/Users/ajda/opt/miniconda3/envs/o3/lib/python3.8/site-packages/orangewidget/gui.py", line 1923, in __call__
    self.func(**kwds)
  File "/Users/ajda/orange/orange3/Orange/widgets/data/owdiscretize.py", line 662, in _default_disc_changed
    self._update_points()
  File "/Users/ajda/orange/orange3/Orange/widgets/data/owdiscretize.py", line 599, in _update_points
    points, dvar = induce_cuts(state.method, self.data, var)
  File "/Users/ajda/orange/orange3/Orange/widgets/data/owdiscretize.py", line 585, in induce_cuts
    dvar = _dispatch[type(method)](method, data, var)
  File "/Users/ajda/orange/orange3/Orange/widgets/data/owdiscretize.py", line 48, in <lambda>
    lambda m, data, var: _dispatch[type(m.method)](m.method, data, var),
  File "/Users/ajda/orange/orange3/Orange/widgets/data/owdiscretize.py", line 51, in <lambda>
    EqualFreq: lambda m, data, var: disc.EqualFreq(m.k)(data, var),
  File "/Users/ajda/orange/orange3/Orange/preprocess/discretize.py", line 150, in __call__
    return Discretizer.create_discretized_var(
  File "/Users/ajda/orange/orange3/Orange/preprocess/discretize.py", line 71, in create_discretized_var
    values = [
  File "/Users/ajda/orange/orange3/Orange/preprocess/discretize.py", line 72, in <listcomp>
    cls._fmt_interval(low, high, fmt)
  File "/Users/ajda/orange/orange3/Orange/preprocess/discretize.py", line 53, in _fmt_interval
    assert low is None or high is None or low < high
AssertionError

The problem is it is badly formatted in Test and Score and unclear in Naive Bayes widget.

Second issue: Why is this discretized to 3 bins when there are only 2 distinct values?

Var Discretized
99.93 < 99.93
99.93 < 99.93
99.93 < 99.93
99.93 < 99.93
99.93 99.93 - 99.94
99.93 99.93 - 99.94
99.93 99.93 - 99.94
99.93 99.93 - 99.94
99.93 99.93 - 99.94
99.93 99.93 - 99.94
99.95 ≥ 99.94
99.95 ≥ 99.94
99.95 ≥ 99.94
janezd commented 3 years ago

Why is this discretized to 3 bins when there are only 2 distinct values?

Count'em.

Var Category
continuous  Bad Good
    class
99.93000000000002   Good
99.93000000000002   Good
99.93000000000002   Good
99.93000000000002   Good
99.93000000000004   Bad
99.93000000000004   Bad
99.93000000000004   Good
99.93000000000004   Good
99.93000000000004   Good
99.93000000000004   Good
99.95000000000002   Bad
99.95000000000002   Bad
99.95000000000006   Bad
ajdapretnar commented 3 years ago

I know about the rounding error, but how does the same float get rounded differently?

janezd commented 3 years ago

I don't understand - what do you mean by same float "rounded differently"? This data is read from file in which it has four distinct values there, hence three ((hum, three?) bins. I think that Orange works correctly regarding the second issue.

The first one is more interesting: differences between thresholds are below precision, so two consecutive thresholds are the same and assertion fails. I added np.unique to handle this, now I'm trying to write a test.

ajdapretnar commented 3 years ago

I identified the problem. The data was sent through Pivot Table and asked for a mean. Even though all the instances for the group have the same value (99.93), when they are grouped, the "rounding error" appears, but it is not visible in the data table (the value looks like 99.93, even when it is not). Hence the misunderstanding. I agree that for the second issue Orange indeed works as expected.