Add additional linguistic information to saved queries

UUDigitalHumanitieslab / texcavator

Text mining on the Royal Library newspaper corpus

http://texcavator.surfsaralabs.nl

Apache License 2.0

11 stars 1 forks source link

Add additional linguistic information to saved queries #74

Open melvinwevers opened 8 years ago

melvinwevers commented 8 years ago

It would be useful to provide the user with some additional linguistic information.

The number of words (tokens) in a query (corpus)
The option to look for count the instances of one particular keyword within the corpus (this would require an additional search window perhaps)

This enables the user to calculate (or have the computer calculate) the normalized frequency of a word within the sub-collection (or entire collection)

jgonggrijp commented 8 years ago

For my understanding: are you trying to solve the same problem as @PimHuijnen in #69? If not, what is the difference?

melvinwevers commented 8 years ago

I think @PimHuijnen wants a different way to generate wordclouds.

I would like to have some linguistic information on the saved queries. So just a number that says how many words there are within a query.

Still, I think both require the same calculation, namely how many words are there in a query.

Second part of this reply extracted to #79 by @jgonggrijp

jgonggrijp commented 8 years ago

When you say "how many words there are within a query", do you mean

the total length (raw word count) of all matches combined, or
the total number of unique words (repetitions not counted) across all the matches, or
the total number of occurrences of the search terms in the matches, or
something else?

And when you say "normalized frequency of a word within a collection", I presume that you divide one number by another. What would be the numerator and what would be the denominator?

melvinwevers commented 8 years ago

The total amount of words (raw word count) found within the documents belonging to a saved query.

I would like to know how the relative occurrence of a word within a collection based on a query.

So, If I would query Vietnam AND Soviet and this would yield 300 documents. I would like to know how many words there are in these 300 documents. let saw 3000

Then I would like to be able to know how often America appeared in this subset of 300 documents. let say 15 times.

Then this frequency would be: 15/3000

This allows me to compare the relative frequency of words within particular corpora.

jgonggrijp commented 8 years ago

Ok, clear!

mhkuu commented 8 years ago

ElasticSearch can provide the number of words (i.e. tokens) per document, I think: https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-core-types.html#token_count (see this issue as well).

Total number of occurrences of a term can be found in the word cloud, or am I missing something?

melvinwevers commented 8 years ago

It gives the occurrences per term within the sub-collection/saved query. But only if the word also appears in the word-cloud. You cannot query for a particular word.

mhkuu commented 8 years ago

Note to self: the link above is broken; in ElasticSearch 2.0 this is the correct URL: https://www.elastic.co/guide/en/elasticsearch/reference/current/token-count.html.

In the 1.7 branch this is the correct URL: https://www.elastic.co/guide/en/elasticsearch/reference/1.7/mapping-core-types.html#token_count