Open melvinwevers opened 8 years ago
For my understanding: are you trying to solve the same problem as @PimHuijnen in #69? If not, what is the difference?
I think @PimHuijnen wants a different way to generate wordclouds.
I would like to have some linguistic information on the saved queries. So just a number that says how many words there are within a query.
Still, I think both require the same calculation, namely how many words are there in a query.
Second part of this reply extracted to #79 by @jgonggrijp
When you say "how many words there are within a query", do you mean
And when you say "normalized frequency of a word within a collection", I presume that you divide one number by another. What would be the numerator and what would be the denominator?
The total amount of words (raw word count) found within the documents belonging to a saved query.
I would like to know how the relative occurrence of a word within a collection based on a query.
So, If I would query Vietnam AND Soviet and this would yield 300 documents. I would like to know how many words there are in these 300 documents. let saw 3000
Then I would like to be able to know how often America appeared in this subset of 300 documents. let say 15 times.
Then this frequency would be: 15/3000
This allows me to compare the relative frequency of words within particular corpora.
Ok, clear!
ElasticSearch can provide the number of words (i.e. tokens) per document, I think: https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-core-types.html#token_count (see this issue as well).
Total number of occurrences of a term can be found in the word cloud, or am I missing something?
It gives the occurrences per term within the sub-collection/saved query. But only if the word also appears in the word-cloud. You cannot query for a particular word.
Note to self: the link above is broken; in ElasticSearch 2.0 this is the correct URL: https://www.elastic.co/guide/en/elasticsearch/reference/current/token-count.html.
In the 1.7 branch this is the correct URL: https://www.elastic.co/guide/en/elasticsearch/reference/1.7/mapping-core-types.html#token_count
It would be useful to provide the user with some additional linguistic information.
This enables the user to calculate (or have the computer calculate) the normalized frequency of a word within the sub-collection (or entire collection)