Setting preferences for rake keyword extraction

bnosac / udpipe

R package for Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing Based on the UDPipe Natural Language Processing Toolkit

https://bnosac.github.io/udpipe/en

Mozilla Public License 2.0

209 stars 33 forks source link

Setting preferences for rake keyword extraction #39

Closed fahadshery closed 5 years ago

fahadshery commented 5 years ago

Deg(w)/freq(w) favours longer keywords and therefore results in extracted keywords that occur in fewer documents. I wish to extract those keywords that are also referenced higher within the set of documents. This is more relevant to the business problem when analysing customer feedback. So how can we set RAKE to score by deg(w) in order to favour shorter keywords that occur across more feedbacks I.e. more people are talking about it? Ideally, I want to capture all references to the extracted keywords. For example can we get the referenced document frequency ref(k) the number of feedbacks in which the keyword occurred as a Candidate keyword and extracted document frequency edf(k) the number of feedbacks from which the keyword was extracted? We can then find out about a keyword being exclusive or essential for that set of feedbacks to inform the business to take action by edf(k) / rdf(k). Is there a way to get this included?

fahadshery commented 5 years ago

So can we get edf(w) and rdf(w) as well as separate cols ? This will help calculating essentiality and generalality of a keyword. Meaning the keywords who are both essential and general within a set of documents are essential within that corpus of feedbacks and will be more helpful for businesses to take action on

jwijffels commented 5 years ago

the easiest to achieve this with the current functions is to add bigrams/trigrams/n-grams to the data.frame and then look if they are part of the keywords coming out of keywords_rake

library(udpipe)
library(data.table)
data(brussels_reviews_anno)
x <- subset(brussels_reviews_anno, language == "nl")
keywords <- keywords_rake(x = x, term = "lemma", group = "doc_id", 
                          relevant = x$xpos %in% c("NN", "JJ"), sep = "-")

x <- setDT(x)
x[, bigram := txt_nextgram(lemma, n = 2, sep = "-"), by = list(doc_id, paragraph_id, sentence_id)]
x[, trigram := txt_nextgram(lemma, n = 3, sep = "-"), by = list(doc_id, paragraph_id, sentence_id)]
x$bigram_in_rakeset <- x$bigram %in% keywords$keyword
x$trigram_in_rakeset <- x$trigram %in% keywords$keyword

And next do whichever aggregation you like. Or use txt_recode_ngramto recode your terms to ngrams and next do whichever aggregation you like.

If you want to make the keywords_rakefunction return these 2 fields, you will need to intervene in the function keywords_rake itself (code at https://github.com/bnosac/udpipe/blob/master/R/nlp_rake.R#L48). Feel free to make a pull request.

fahadshery commented 5 years ago

the easiest to achieve this with the current functions is to add bigrams/trigrams/n-grams to the data.frame and then look if they are part of the keywords coming out of keywords_rake

Sorry for being thick, but how does it help me calculating a deg(w) As I said, I am more interested in finding keywords which are high in freq and high in rake score. This helps with assigning a priority score to focus on when creating a business strategdy.

I am also unsure of what do you mean by:

and next do whichever aggregation you like.

After using the txt_recode_ngram i.e. what purpose does it serve or what exactly we're trying to achieve?

Sorry if this sounds dumb

rdatasculptor commented 5 years ago

Doesn't the rake function give back the frequency (N) next to the rake score?

jwijffels commented 5 years ago

good point @rdatasculptor . freq is in the output as well as the rake score so you can combine both as you wish.
degree is calculated at the word level, not at the keyword level, if you want degree at the word level, just look at the code (https://github.com/bnosac/udpipe/blob/master/R/nlp_rake.R#L48), you can have it with 3 lines of code.
txt_recode_ngramrecodes words to compound terms in your data.frame (see ?txt_recode_ngram) and next you can use that compound term to do any kind of aggregations (with dplyr/data.table) like frequency overall/frequency by doc/...

fahadshery commented 5 years ago

@jwijffels thank you so much for detailed explanation and support. Ideally, I would like to calculate the edf & ref as explained earlier:

For example can we get the referenced document frequency ref(k) the number of feedbacks in which the keyword occurred as a Candidate keyword and extracted document frequency edf(k) the number of feedbacks from which the keyword was extracted? We can then find out about a keyword being exclusive or essential for that set of feedbacks to inform the business to take action by edf(k) / rdf(k).

edf(k) / rdf(k) = 1, will indicate that they were extracted from every feedback in which they were referenced.

My last question is about the Rake score itself. I am trying to get my head around how to explain a rake score difference. For example, how to explain the difference between rake score let's say 2 and 3 or what's the significance/importance of a keyword whose rake score is 3 as compared to a keyword whose rake score is 2?

jwijffels commented 5 years ago

@fahadshery

Feed your rake keywords into txt_recode_ngram to recode words to compound terms in your data.frame (see ?txt_recode_ngram) and next you can use that compound term to do any kind of aggregations (with dplyr/data.table) like frequency overall/frequency by doc/...
Rake is defined at the paper https://www.researchgate.net/publication/227988510_Automatic_Keyword_Extraction_from_Individual_Documents If a rake score is high it consists of words which occur more with other words. I would never subtract rake scores like 3-2 Closing as answers were given how to reach your goal.