biolab / orange3

🍊 :bar_chart: :bulb: Orange: Interactive data analysis
https://orangedatamining.com
Other
4.79k stars 997 forks source link

Continuize: multiple attributes with the same name #6877

Open ZanMervic opened 3 weeks ago

ZanMervic commented 3 weeks ago

What's wrong?

This issue is related to the issue with Discretization #6876. If the input to the Continuize widget has attributes with multiple values with the same "name"/"value" (see Issue #6876 for a better explanation), the One-hot encoding will create multiple attributes with the same name which results in an exception.

Workflow I used (an extension of the workflow from issue #6876): image

Exception:

Traceback (most recent call last):
  File "C:\Users\zanme\work\orange3\orange3\Orange\widgets\data\owcontinuize.py", line 458, in _on_radio_clicked
    self.commit.deferred()
  File "C:\Users\zanme\miniconda3\envs\orange3\Lib\site-packages\orangewidget\gui.py", line 2006, in conditional_commit
    do_commit()
  File "C:\Users\zanme\miniconda3\envs\orange3\Lib\site-packages\orangewidget\gui.py", line 2014, in do_commit
    commit.call()
  File "C:\Users\zanme\miniconda3\envs\orange3\Lib\site-packages\orangewidget\gui.py", line 1879, in call
    acting_func(instance)
  File "C:\Users\zanme\work\orange3\orange3\Orange\widgets\data\owcontinuize.py", line 517, in commit
    self.Outputs.data.send(self._prepare_output())
                           ^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\zanme\work\orange3\orange3\Orange\widgets\data\owcontinuize.py", line 534, in _prepare_output
    return self.data.transform(Domain(attrs, class_vars, metas))
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\zanme\work\orange3\orange3\Orange\data\domain.py", line 154, in __init__
    raise Exception('All variables in the domain should have'
Exception: All variables in the domain should have unique names.

Screenshot of the raised exception and the two attributes with the same name:

image

Note

Because of this issue, a test was failing for the ScoringSheet widget. I have temporarily excluded the widget from the test, but it should be included again when the issue is resolved.

Test: Orange.tests.test_classification.LearnerAccessibility.test_all_models_work_after_unpickling_pca

How can we reproduce the problem?

Zip of the workflow: continuize_bug.zip

To reproduce the problem, set the PCA components to 8 in the provided workflow. image

What's your environment?

janezd commented 3 weeks ago

Code may assume that values of categorical variables are unique. The bug is thus in discretization. Adding np.unique, as I suggested in a comment in #6876, resolves it.

I nevertheless made #6878 to prevent construction of variables with duplicated values, so any future bugs that result in duplicated values will be reported earlier, at the appropriate place.