UUDigitalHumanitieslab / texcavator

Text mining on the Royal Library newspaper corpus
http://texcavator.surfsaralabs.nl
Apache License 2.0
11 stars 1 forks source link

Add additional linguistic information to saved queries #74

Open melvinwevers opened 8 years ago

melvinwevers commented 8 years ago

It would be useful to provide the user with some additional linguistic information.

This enables the user to calculate (or have the computer calculate) the normalized frequency of a word within the sub-collection (or entire collection)

jgonggrijp commented 8 years ago

For my understanding: are you trying to solve the same problem as @PimHuijnen in #69? If not, what is the difference?

melvinwevers commented 8 years ago

I think @PimHuijnen wants a different way to generate wordclouds.

I would like to have some linguistic information on the saved queries. So just a number that says how many words there are within a query.

Still, I think both require the same calculation, namely how many words are there in a query.

Second part of this reply extracted to #79 by @jgonggrijp

jgonggrijp commented 8 years ago

When you say "how many words there are within a query", do you mean

And when you say "normalized frequency of a word within a collection", I presume that you divide one number by another. What would be the numerator and what would be the denominator?

melvinwevers commented 8 years ago

The total amount of words (raw word count) found within the documents belonging to a saved query.

I would like to know how the relative occurrence of a word within a collection based on a query.

So, If I would query Vietnam AND Soviet and this would yield 300 documents. I would like to know how many words there are in these 300 documents. let saw 3000

Then I would like to be able to know how often America appeared in this subset of 300 documents. let say 15 times.

Then this frequency would be: 15/3000

This allows me to compare the relative frequency of words within particular corpora.

jgonggrijp commented 8 years ago

Ok, clear!

mhkuu commented 8 years ago

ElasticSearch can provide the number of words (i.e. tokens) per document, I think: https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-core-types.html#token_count (see this issue as well).

Total number of occurrences of a term can be found in the word cloud, or am I missing something?

melvinwevers commented 8 years ago

It gives the occurrences per term within the sub-collection/saved query. But only if the word also appears in the word-cloud. You cannot query for a particular word.

mhkuu commented 8 years ago

Note to self: the link above is broken; in ElasticSearch 2.0 this is the correct URL: https://www.elastic.co/guide/en/elasticsearch/reference/current/token-count.html.

In the 1.7 branch this is the correct URL: https://www.elastic.co/guide/en/elasticsearch/reference/1.7/mapping-core-types.html#token_count