Closed manuelbickel closed 5 years ago
Thanks for reporting, I will try to investigate tomorrow.
Off-topic. I've noticed you use stemming with topic models. Won't it be better to use lemmatization? From my experience it will produce more interpretable topics. Another trick is to use only particular parts of speech like "NOUN", "ADJ", "ADV".
Take a look to udpipe package which annotates text nicely and supports many languages. Here is example of tokenizer which keeps only particular POS:
pos_lemma_tokenizer = function(x, udp_model, pos_keep = c("NOUN", "ADJ", "ADV")) {
res = as.data.table(udpipe_annotate(udp_model, x = x, doc_id = seq_along(x), parser = "none"))
setDT(res)
temp = data.table(doc_id = seq_along(x))
if(!is.null(pos_keep))
res = res[upos %in% pos_keep, ]
res = res[, .(doc_id = as.integer(doc_id), lemma = tolower(lemma))]
res = res[, .(tokens = list(lemma)), keyby = doc_id]
res = res[temp, on = .(doc_id = doc_id)]
res[is.na(tokens), tokens := NULL]
res$tokens
}
Or you can annotate texts once (because it is time consuming) and then play with different POS combinations.
If it helps please consider to add article to text2vec.org via PR - all articles are here :-)
Thank you for your quick reply. Regarding the first issue: I will try to reproduce the behaviour with movie_review tomorrow, so the issue is easier to investigate/follow (just thought you might have an idea directly).
What follows is off-topic: Thank you very much for your tips and the code snippet, I really appreciate this. I have also just realized that stemming might be too aggressive and an approach via POS tagging might be more fruitful (I am learning step by step by myself, therefore, I am sometimes going into the "wrong" direction). I will play around with a POS tagging approach and hope my machine does not let me down.
I am definitively going to share my study/article to the extent possible. Since its for my PhD I have to select a suitable Journal. Professors might prefer Journal with higher Impact Factor that might, however, have some copyright restrictions. Maybe I can get funding for Open Source. I am currently also discussing with Elsevier if I can share the data set (Scopus download). Maybe it works. I will keep you updated and will share as much as possible.
As promised I have reproduced the behaviour with movie_review - see example below.
When searching for the phrases "SECRET OF"/"OF KELLS" in the docs via negative lookahead/lookbehind it appears that both phrases do only occur as the 3 gram "SECRET OF KELLS" with doc_count=2
and term_count=8
, but not individually. However, the vocabulary includes "SECRET_OF_KELLS" with a doc_count=2 and term_count=7
and in addition "SECRET_OF" with a doc_count=1
and term_count=1
.
I hope there is no error in the logic of my check. I have just started trying to understand the collocation algorithm, so maybe from the statistical perspective of hierarchical learning the results are fine. Just intuitively, I would have expected only the 3-gram collocation in the vocabulary.
library(text2vec)
packageVersion("text2vec")
#[1] '0.5.0.10'
docs <- movie_review$review
cc_model <- Collocations$new(collocation_count_min = 5
,pmi_min = 1
,gensim_min = 0
,llr_min = 0
,lfmd_min = -25
)
iterator <- itoken(docs, progressbar = F)
cc_model$fit(iterator, n_iter = 5)
# INFO [2018-01-09 08:43:27] iteration 1 - found 2358 collocations
# INFO [2018-01-09 08:43:29] iteration 2 - found 3396 collocations
# INFO [2018-01-09 08:43:31] iteration 3 - found 3601 collocations
# INFO [2018-01-09 08:43:33] iteration 4 - found 3615 collocations
# INFO [2018-01-09 08:43:35] iteration 5 - found 3615 collocations
cc_stat <- cc_model$collocation_stat
iterator <- cc_model$transform(iterator)
vocabulary <- create_vocabulary(iterator)
vocabulary_subset_cc <- vocabulary[grep("_", vocabulary$term),]
# term term_count doc_count
# 1: _tried 1 1
# 2: Eyre_ 1 1
# 3: SECRET_OF 1 1
# 4: G_d 1 1
# 5: Bug's_Life 1 1
# ---
# 3573: is_a 1221 985
# 3574: on_the 1419 1062
# 3575: in_a 1516 1167
# 3576: of_the 3748 2061
# 3577: in_the 3793 2230
cc_stat[unique(c(grep("^SECRET$|^SECRET_OF", cc_stat$prefix),
grep("^OF_KELLS|^KELLS", cc_stat$suffix)))
,]
# prefix suffix n_i n_j n_ij pmi lfmd gensim llr rank_pmi rank_lfmd rank_gensim rank_llr
# 1: SECRET_OF KELLS 8 10 7 16.23807 -17.65250 22083.975 160.1924 40 97 29 1971
# 2: SECRET OF 8 76 8 13.77194 -20.26780 5245.648 153.6090 274 543 166 2072
# 3: OF KELLS 76 10 7 13.25737 -21.16766 2797.679 122.0923 334 766 235 2547
vocabulary[grep("SECRET|KELLS", vocabulary$term),]
# term term_count doc_count
# 1: SECRET_OF 1 1
# 2: KELLS. 1 1
# 3: KELLS 3 1
# 4: SECRET_OF_KELLS 7 2
#results of "manual" term_count via stri_count and doc_count via sum(grepl()) (detailed commands see below)
# term term_count doc_count
# SECRET_OF_KELLS 8 2
# SECRET_OF 8 2
# OF_KELLS 8 2
# SECRET 8 2
# KELLS 11 2
# OF 76 52
#additional check via negative lookbehind
#shows that there is no occurence of "OF KELLS" not preceeded by "SECRET"
sum(stringi::stri_count_regex(docs, "(?<!\\SECRET )OF KELLS\\b"))
#[1] 0
sum(stringi::stri_count_regex(docs, "\\bSECRET OF (?!KELLS)"))
#[1] 0
#-------------------------------detailed commands to get term/doc counts manually
#TERM_COUNT of individual terms/collocations
#3-gram
sum(stringi::stri_count_regex(docs, "\\SECRET OF KELLS\\b"))
#[1] 8
#2-gram
sum(stringi::stri_count_regex(docs, "\\bSECRET OF\\b"))
#[1] 8
sum(stringi::stri_count_regex(docs, "\\bOF KELLS\\b"))
#[1] 8
#1-gram
sum(stringi::stri_count_regex(docs, "\\bSECRET\\b"))
#[1] 8
sum(stringi::stri_count_regex(docs, "\\bKELLS\\b"))
#[1] 11
sum(stringi::stri_count_regex(docs, "\\bOF\\b"))
# [1] 76
#DOC_COUNT of individual terms/collocations
#3-GRAM
sum(grepl("\\SECRET OF KELLS\\b", docs, perl = T))
#[1] 2
#2-gram
sum(grepl("\\SECRET OF\\b", docs, perl = T))
#[1] 2
sum(grepl("\\bOF KELLS\\b", docs, perl = T))
#[1] 2
#1-gram
sum(grepl("\\bSECRET\\b", docs, perl = T))
#[1] 2
sum(grepl("\\bKELLS\\b", docs, perl = T))
#[1] 2
sum(grepl("\\bOF\\b", docs, perl = T))
#[1] 52
I think the issue is that you use regex and word boundary patters when you count collocations manually as opposed to Collocations
model where you use simple white-space tokenizer. In the last example the problem is in document number 1195 and id = "11664_9"
.
library(text2vec)
packageVersion("text2vec")
#[1] '0.5.0.10'
docs <- movie_review$review
cc_model <- Collocations$new(collocation_count_min = 5
,pmi_min = 1
,gensim_min = 0
,llr_min = 0
,lfmd_min = -25
)
iterator <- itoken(docs, progressbar = F)
cc_model$fit(iterator, n_iter = 5)
cc_stat <- cc_model$collocation_stat
iterator <- cc_model$transform(iterator)
vocabulary <- create_vocabulary(iterator)
vocabulary_subset_cc <- vocabulary[grep("_", vocabulary$term), ]
cc_stat[unique(c(grep("^SECRET$|^SECRET_OF", cc_stat$prefix),
grep("^OF_KELLS|^KELLS", cc_stat$suffix)))
,]
i = grep("SECRET|KELLS", movie_review$review)
it2 = itoken(movie_review$review[i], n_chunks = 1, ids = movie_review$id[i])
it2 = cc_model$transform(it2)
temp = it2$nextElem()
j = which(startsWith(temp$tokens[[2]], "SECRET_OF"))
j
#2 308 343 383 456 524 575
temp$tokens[[2]][j]
#"SECRET_OF_KELLS" "SECRET_OF" "SECRET_OF_KELLS" "SECRET_OF_KELLS" "SECRET_OF_KELLS" "SECRET_OF_KELLS" "SECRET_OF_KELLS"
temp$tokens[[2]][j + 1]
#"is_an" "KELLS." "convey" "aren't" "is_what" "expresses" "is_a"
As you can see the next token for the token 308 ("SECRET_OF") is "KELLS." (mind dot at the end), not "KELLS". So it is not collapsed into phrase. Try to use word_tokenizer
to fix such behaviour.
I will leave the issue open - please try and report back if you still have issues.
On the other side there are some cases when vocabulary will contain less terms than collocation_count_min
.
Suppose collocation_count_min = 10
and consider you have phrases "I have" and "I have been". Now suppose prefixes-suffix pairs ("I", "have"), ("I", "have_been") appeared in collocation_stat
18 and 17 times respectively. So at the end in vocabulary you will have "I_have_been" 17 times and "I_have" only 1 time. This is because multi-word phrases are collapsed iteratively and we can't throw "I_have" without affecting "I_have_been".
I will think whether this can be improved...
Thank you very much for your detailed explanation, which helped me to understand the output.
The initial conclusion for me is that after having applied a collocation model, during the pruning of vocabulary one may consider an an offset for collocations (collocation_count_min
), depending if collocations such as "I_have" with a low term count are judged as a reasonable collocation and should be kept. I assume that keeping such collocation is still better for downstream analysis than pruning them. Another option might be to "unlearn" collocations below the threshold from the cc_model. If this is reasonable probably depends on the use case.
Anyway, I guess, the difference of the output is probably subtle independent of the approach (not tested, might do that some time if relevant).
Hi Dmitriy, sorry for another detail question about collocations. I have set the
colloction_count_min = 8
, however,vocabulary
includes collocations withterm_count = 1
or2
. I would like to better understand the logic of the output.Below I have have posted an example of the output for the collocation "electr_arc_furnac" in
collocation_stats
,vocabulary
and thedoc_counts
/term_counts
from a manual check bystri_count()
andsum(grepl())
. Manual check shows that "electr_arc_furnac" occurs 42 times. However,vocabulary
says 41.It seems like one combination of "electr_arc" with "furnac" was not created, instead, two separate parts of this collocation with count lower than the threshold are kept in the final vocabulary. I would have expected that there are no collocations with count lower than
collocation_count_min
in thevocabulary
.Could you give me a hint, so that I better understand the final output with resespct to collocations. My intention was to
prune
terms belowterm_count = 1
, in my case I would falsely prune some occurrences fromvocabulary
.