Is your feature request related to a problem? Please describe.
This suggestion is related to #997 and #1077.
The current Scores output of Annotated Corpus Map gives some access to the most typical words per cluster. However, the only way to extract the most typical words per cluster seems to be to show them in a Data Table and then sort descending by the column Score(Cn). If I want listings per cluster, this has to be repeated manually for each cluster.
Describe the solution you'd like
An output like this would be more useful, at least for me:
It would allow to extract the most meaningful words per cluster using common widgets - for instance: Select Rows to filter for highest-ranked scores per cluster, and Group By to get a concatenated list of characteristic words per cluster and averages of Score and p-value:
The first table screenshot is already filtered with Select Rows to show only the top-10 per cluster, but it would make sense to keep everything above a certain Score (fixed or user-definable) and p-value (which is the FDR threshold that is user-definable already)
Describe alternatives you've considered
The attached workflow contains a Python Script that transforms the current Scores output to the suggested output, with the above additional steps added downstream.
Note: It seems not to be uncommon that multiple words in a cluster have the same score. If that is the case, it would make sense to rank them from lowest to highest p-value (as in the first screenshot). However, I noticed that in the cluster labels of the Annotated Corpus Map visualization, words with the same score are not ranked based on p-value.
Is your feature request related to a problem? Please describe. This suggestion is related to #997 and #1077. The current Scores output of Annotated Corpus Map gives some access to the most typical words per cluster. However, the only way to extract the most typical words per cluster seems to be to show them in a Data Table and then sort descending by the column Score(Cn). If I want listings per cluster, this has to be repeated manually for each cluster.
Describe the solution you'd like An output like this would be more useful, at least for me:
It would allow to extract the most meaningful words per cluster using common widgets - for instance: Select Rows to filter for highest-ranked scores per cluster, and Group By to get a concatenated list of characteristic words per cluster and averages of Score and p-value:
The first table screenshot is already filtered with Select Rows to show only the top-10 per cluster, but it would make sense to keep everything above a certain Score (fixed or user-definable) and p-value (which is the FDR threshold that is user-definable already)
Describe alternatives you've considered The attached workflow contains a Python Script that transforms the current Scores output to the suggested output, with the above additional steps added downstream.
Sorted output from Annotated Corpus Map 2.ows.zip
Note: It seems not to be uncommon that multiple words in a cluster have the same score. If that is the case, it would make sense to rank them from lowest to highest p-value (as in the first screenshot). However, I noticed that in the cluster labels of the Annotated Corpus Map visualization, words with the same score are not ranked based on p-value.