biolab / orange3

🍊 :bar_chart: :bulb: Orange: Interactive data analysis
https://orangedatamining.com
Other
4.85k stars 1.01k forks source link

Venn diagram should allow matching by equality #5336

Closed wibrt closed 3 years ago

wibrt commented 3 years ago

Mathematicall a venn diagram is kwalitative, this option is present in the widget (duplicates) for the output.

the problem: identical rows are not recognized as such the warning given is: Some variables have been renamed to avoid duplicates. status Dataset #1 has no suitable identifiers.

a few screenshots to clarify:

the data: image

the venn diagram with the given warning (and wrong count): image

a part of a datatable behind the venn diagram (with duplicates, even though this was not selected in the venn diagram) image

Orange version 3.27.1 on linux installed using pipenv

maybe related to https://github.com/biolab/orange3/issues/4759

ajdapretnar commented 3 years ago

A couple of things here:

  1. Venn diagram is currently not intended for the task of filtering duplicates on a single data set as you intended. This can be done with Unique, where a unique identifier can be used. What is the issue here? Venn looks at "Instance identity", where each row is unique (hash). So it works correctly for the task - it outputs unique rows, where by definition each row is unique (not by values, but by identity). Perhaps we could add other options (variables to Rows, matched by), but this would replicate the functionality of Unique (which is intended to be moved to core Orange).
  2. The first warning is strange and probably wrong?
irgolic commented 3 years ago
1. Venn diagram is currently not intended for the task of filtering duplicates on a single data set as you intended. This can be done with Unique, where a unique identifier can be used. What is the issue here? Venn looks at "Instance identity", where each row is unique (hash). So it works correctly for the task - it outputs unique rows, where by definition each row is unique (not by values, but by identity). Perhaps we could add other options (variables to Rows, matched by), but this would replicate the functionality of Unique (which is intended to be moved to core Orange).

I'm down for adding this, even though it may duplicate Unique's functionality.

Sidenote: What would happen if we omitted the Selected column in the Data signal, should there be no selection?

janezd commented 3 years ago

@ajdapretnar, this reminds me that Unique is still in Prototypes. Can you take a look at https://github.com/biolab/orange3-prototypes/pull/241 (I know you already have a lot on your plate today!), merge if OK, and then I'd create a PR in the core repo?

It has an icon, but no docs, obviously. :)

wibrt commented 3 years ago

the reason for adding this would be the meaning of venn diagram in mathematics, to avoid confusion, using the same terminology, in earlier versions of orange this was the case, the behaviour of this widget changed at some point.

what seems strange is that the hashes for identical rows are different, maybe the arbitrary row number is included in the hash? (with arbitrary i mean, the same dataset but in a different order)

janezd commented 3 years ago

And how would you define the meaning of Venn diagram in mathematics? In what sense does Orange misuse mathematical terminology?

I guess your problem is in the definition of sameness. It is neither about hashes nor about row numbers nor about equality.

wibrt commented 3 years ago

generally speaking, if you add an element x to a set A, and if you add the element x again to the set A, it should only appear once.

indeed as you say: the definition of sameness. (one can debate about this applied definition in this case, not my intention) (the above remark about the hashing was only meant for the background, how it is implemented, i haven't checked it though)

in this case of datamining, my practical opinion would be: the suggestion of having several options for defining what is the same, would help a lot, like @irgolic changed the title of the topic, allowing to match for equality)