biolab / orange3-text

🍊 :page_facing_up: Text Mining add-on for Orange3
Other
127 stars 84 forks source link

Annotated corpus map: provide corrected instead of uncorrected p-value in Scores output #1077

Open wvdvegte opened 3 months ago

wvdvegte commented 3 months ago

Is your feature request related to a problem? Please describe. This relates to #997. There's no way (at least not in Orange) to get a table with the keywords per cluster. They can only be obtained in the graphical representation of the Annotated Corpus Map. Sometimes you' d like to present the characteristic keywords per cluster in a table - for instance together with other information about the cluster that can easily be obtained with Group By > Cluster (e.g., the number of documents in each cluster). The help information says "FDR Threshold sets the threshold for selecting a keyword as a cluster's keyword", but only the uncorrected p values per word are available in the Scores output.

Describe the solution you'd like Replace the uncorrected keywords by the corrected ones. Or even better, provide some form of output that provides the keywords per cluster right away (e.g., a table with the columns Cluster, Keyword 1, Keyword2, ... Keyword5). Because, even if you have the corrected p values, I don't see a way of getting such a table, especially not if the number of clusters isn't kept constant (varying the number of clusters to see the effects)

Describe alternatives you've considered None are known to me.

wvdvegte commented 3 months ago

I now realized that actually the p-values in the Scores output are the corrected ones, a.k.a. FDRs. But it remains confusing: why refer to FDR in the menu where the threshold is set and simply call them p-value in the Scores output? I suggest the same term is used (either 'FDR' or 'corrected p-value') where the same number is meant.