biolab / orange3-text

🍊 :page_facing_up: Text Mining add-on for Orange3
Other
127 stars 81 forks source link

Add option to remove POS tagging before input to word cloud #991

Open wvdvegte opened 1 year ago

wvdvegte commented 1 year ago

Is your feature request related to a problem? Please describe. In a workflow where I applied POS tagging to allow selecting (for instance) just nouns and verbs, then Bag of Words, Distances, Hierarchical Clustering and visualize clusters in Word Cloud, the word cloud shows all words with their POS tags, and words that are present with different tags are shown multiple times:

image

Instead I would like to be able to see each word in Word Cloud only once, without POS tagging. Contrary to Bag of Words, widgets with similar functionality such as Document Embedding or Similarity Hashing do not produce output with POS tagging.

Describe the solution you'd like I think there are different options:

Describe alternatives you've considered Couldn't find any

wvdvegte commented 1 year ago

Two small corrections:

PrimozGodec commented 1 year ago

Thank you for the report. I think we should internally discuss the best solution to this issue. Is there any other situation where you would like to have pos tags and then have them removed later besides the following two:

wvdvegte commented 1 year ago

Yes, I assume the POS tags (if present) make a difference not only in filtering but in any type of analysis (classification, clustering, network analysis, ...), but I'd like to have the choice not to show them in any type of visualization - not only Word Cloud but also, for instance, Annotated Corpus Map and even in Data Table. There, I think it also makes sense to merge different 'versions' of a word, like 'practitioner' in my screenshot above.

wvdvegte commented 1 year ago

BTW, Annotated Corpus Map is clustering and visualization in one widget. I seems to makes sense to consider the POS tags for clustering but not for the visualization.

ajdapretnar commented 2 months ago

This is a bit of a stale issue but I gave it some thought. Word Cloud currently doesn't show POS tags anymore. However, it would not merge two words with the same name into one. I propose adding an option to remove POS tags in Preprocess Text. It makes the most sense to me. That said - where in Preprocess Text? As a final option in POS Tagger? As in "POS tag or remove any tags"?

wvdvegte commented 2 months ago

I agree this could best be added to Preprocess Text. However, if you add it to POS Tagger, you have to activate POS Tagger twice: once before and once after Filtering. Perhaps it makes more sense as a final option in Filtering, where the current final option is filtering based on POS tags?

ajdapretnar commented 2 months ago

Duh, how did this not occur to me? 🤦‍♀️ Filtering it is.