biolab / orange3

🍊 :bar_chart: :bulb: Orange: Interactive data analysis
https://orangedatamining.com
Other
4.85k stars 1.01k forks source link

Box Plot: sparse variables are ranked wrong #5699

Closed ajdapretnar closed 2 years ago

ajdapretnar commented 2 years ago

What's wrong?

When analyzing topic distinctions with Box Plot, one would select a categorical variable to group by and use Order by relevance to subgroups to rank the variables by how well they split between categorical values.

In text mining it can happen (but I guess elsewhere too), that there is only one document in a specified group, so Box Plot cannot compute significance, which is fine.

The bug is that this variable is placed before one where significance can be computed. Is this expected? A bunch of variables where all instances belong to one group are also placed very high. I, of course, expect to see these variables in a list, but probably there could be a checkbox (gosh, another one?) which would hide such variables? Or do we need a separate widget for text mining?

How can we reproduce the problem?

Corpus (book-excerpts) - Preprocess Text (default) - Bag of Words (uncheck Hide bow attributes) - Box Plot. Use Category as Subgroup, then observe the ranked order of variables.

What's your environment?

ajdapretnar commented 2 years ago
Screen Shot 2021-11-22 at 14 29 17
ajdapretnar commented 2 years ago

Upon inspection, it turns out there are two underlying issues:

1.) Sparse columns are not treated correctly. Absence of words is represented as nan, which is not considered in Box Plot when computing averages. Nans should be treated as 0 in this case. Example: 4 documents out of 5 in category A have the word "orange", while only 1 document out of 3 in category B have this word. Box Plot reports "At least one group has just one instance", even though category B has 3 documents. The average should be 4/5 to 1/3.

2.) When there's indeed a case with one category containing just one instance, these should be ranked correctly. Now they appear in somewhat random order.