biolab / orange3-text

🍊 :page_facing_up: Text Mining add-on for Orange3
Other
127 stars 84 forks source link

Annotated corpus map: add output option with keywords (cluster labels) per cluster #997

Open wvdvegte opened 1 year ago

wvdvegte commented 1 year ago

Is your feature request related to a problem? Please describe. After generating clusters with Annotated Corpus Map, I'd like to create a table describing the clusters, e.g. average year of publication, most frequently occurring publisher, etc., but also the words that are typical for the documents in the cluster (cluster labels in Annotated Corpus Map). I can do most of this with Group By, but there is no way to extract the characteristic words per directly.

Describe the solution you'd like Add an output option for Annotated Corpus Map with keywords (cluster labels) per cluster - preferably with a user-definable maximum (not just the 5 that are produced when cranking up the 'Cluster labels' slider).

Describe alternatives you've considered Let's say I have 10 clusters, I could do Select Rows to select a cluster, then Extract Keywords per cluster. I have to do this 10x in parallel, then Concatenate and Group By Source ID to get an overview of the words per cluster which I could merge with the other grouped data per cluster. But this gives me slightly different keywords, and the replication needed to treat each cluster in parallel makes this a cumbersome workaround - especially because it hardly allows me to vary the number of clusters (which necessitates adding/removing parallel branches of Select Rows -> Extract Keywords)

wvdvegte commented 1 year ago

I noticed that the latest version now has a Scores output, which at least gives access to the most typical words per cluster. However, the only way to get the most typical words per cluster seems to be to show them in a Data Table and then sort descending by the column Score(Cn). This still has to be repeated manually for each cluster if I want listings per cluster. Also, I noticed that the order of words by highest score is different from the one shown in the cluster labels in the visualization. Which makes me wonder how the order in the cluster label is determined. In my dataset, there are two words with the same score and different p-values, but the one with the highest p-value (least significance) comes first in the cluster label...