biolab / orange3-text

🍊 :page_facing_up: Text Mining add-on for Orange3
Other
128 stars 84 forks source link

[ENH] Show IDF results in a Word Cloud #401

Closed benel closed 4 years ago

benel commented 5 years ago
Text version

0.5.2

Orange version

3.16

Expected behavior

My aim would be to show TF.IDF in action to my students. When connecting a corpus to a bag of words and the bag of words to a data table (or even better to a word cloud), I would expect that changing the document frequency parameter in the bag of words from none to IDF would change the result (hiding common words in the language like "the", similarly to a stop words preprocessing, but also hiding words common to the corpus like "queen" for a tales corpus).

Actual behavior

Changing the parameter doesn't seem to change anything in the result. @ajdapretnar explained the following in a related ticket (biolab/orange3#3426):

(...) for the Data Table, you should definitely see the changes when using the IDF transformation. Word Cloud, however, is currently implemented in a way that it shows frequent tokens, that are a separate property from a table, which is constructed from a bag of words. That said, your idea sounds interesting, since I cannot think of a good way to sort words by IDF frequencies. Could you perhaps open a feature request on our issue tracker

ajdapretnar commented 5 years ago

Thank you for opening this. I normally show IDF in a Data Table as seen below. But you are making a point. Having a hidden token attribute is a big confusing for users and showing this in a Word Cloud could have a nice educational value.

IDF in action, even though in a slightly confusing sparse format:

screen shot 2018-11-29 at 10 32 17
PrimozGodec commented 4 years ago

It is fixed #486. Now the Word Cloud shows the bag-of-words weights in a word cloud.