Logic of collocation output in vocabulary with respect to collocation_count_min?

Hi Dmitriy, sorry for another detail question about collocations. I have set the colloction_count_min = 8, however, vocabulary includes collocations with term_count = 1 or 2. I would like to better understand the logic of the output.

Below I have have posted an example of the output for the collocation "electr_arc_furnac" in collocation_stats, vocabulary and the doc_counts/term_counts from a manual check by stri_count() and sum(grepl()). Manual check shows that "electr_arc_furnac" occurs 42 times. However, vocabulary says 41.

It seems like one combination of "electr_arc" with "furnac" was not created, instead, two separate parts of this collocation with count lower than the threshold are kept in the final vocabulary. I would have expected that there are no collocations with count lower than collocation_count_min in the vocabulary.

Could you give me a hint, so that I better understand the final output with resespct to collocations. My intention was to prune terms below term_count = 1, in my case I would falsely prune some occurrences from vocabulary.

#package version 0.5.0.10
cc_model = Collocations$new(collocation_count_min = 8
                             ,pmi_min = 1.38
                             ,gensim_min = 0
                             ,lfmd_min = -28.571 
                             ,llr_min = 0
                             , sep = "_"
)
cc_model$fit(iterator_docs, n_iter = 6)
# INFO [2018-01-05 15:16:46] iteration 1 - found 6046 collocations
# INFO [2018-01-05 15:16:51] iteration 2 - found 8359 collocations
# INFO [2018-01-05 15:16:57] iteration 3 - found 8631 collocations
# INFO [2018-01-05 15:17:02] iteration 4 - found 8658 collocations
# INFO [2018-01-05 15:17:08] iteration 5 - found 8663 collocations
# INFO [2018-01-05 15:17:14] iteration 6 - found 8663 collocations
check <- cc_model$collocation_stat
check[c(grep("^arc$|^electr_arc$", check$prefix),
         grep("^arc$", check$suffix))
       , ]
#        prefix suffix   n_i n_j n_ij       pmi      lfmd     gensim      llr rank_pmi rank_lfmd rank_gensim rank_llr
# 1: electr_arc furnac    43 165   37 13.565511 -18.31287 9502.15349 681.1788      374       233         199      854
# 2:        arc furnac   105 289   42 11.891696 -20.10077 3076.00448 634.4562      794       575         432      923
# 3:     plasma    arc   326 105   13 10.026016 -25.35020  401.01285 156.9139     1742      3426        1538     4950
# 4:     electr    arc 12517 105   44  6.522135 -25.33610   75.19832 332.2835     5162      3407        3676     1980

iterator_docs_cc <- cc_model$transform(iterator_docs)
vocabulary <- create_vocabulary(iterator_docs_cc)
vocabulary[grep("arc_furnac|electr_arc", vocabulary$term),]
# term term_count doc_count
# 1:        arc_furnac          1         1
# 2:        electr_arc          2         2
# 3: electr_arc_furnac         41        20

#results of term_count via stri_count and doc_count via sum(grepl()) (detailed commands see below)
# term              term_count  doc_count
# electr arc furnac         42         20 
# electr arc                44         21
# arc furnac                42         20
# arc                      105         61
# furnac                   289        153
# electrc                12517       5793

#-------------------------------detailed commands to get term/doc counts manually
#TERM_COUNT of individual terms/collocations
#3-GRAM
sum(stringi::stri_count_regex(docs_ids[,doc], "\\belectr arc furnac\\b"))
#[1] 42

#2-GRAM
sum(stringi::stri_count_regex(docs_ids[,doc], "\\belectr arc\\b"))
#[1] 44
sum(stringi::stri_count_regex(docs_ids[,doc], "\\barc furnac\\b"))
#[1] 42

#1-GRAM
sum(stringi::stri_count_regex(docs_ids[,doc], "\\barc\\b"))
#[1] 105
sum(stringi::stri_count_regex(docs_ids[,doc], "\\bfurnac\\b"))
#[1] 289
sum(stringi::stri_count_regex(docs_ids[,doc], "\\belectr\\b"))
#[1] 12517

#DOC_COUNT of individual terms/collocations
#3-GRAM
length(docs_ids[grep("\\belectr arc furnac\\b", doc, perl =T), doc])
#[1] 20

#2-GRAM
length(docs_ids[grep("\\belectr arc\\b", doc, perl =T), doc])
#[1] 21
length(docs_ids[grep("\\barc furnac\\b", doc, perl =T), doc])
#[1] 20

#1-GRAM
length(docs_ids[grep("\\barc\\b", doc, perl =T), doc])
#[1] 61
length(docs_ids[grep("\\bfurnac\\b", doc, perl =T), doc])
#[1] 153
length(docs_ids[grep("\\belectr\\b", doc, perl =T), doc])
#[1] 5793

Thanks for reporting, I will try to investigate tomorrow.

Off-topic. I've noticed you use stemming with topic models. Won't it be better to use lemmatization? From my experience it will produce more interpretable topics. Another trick is to use only particular parts of speech like "NOUN", "ADJ", "ADV".

Take a look to udpipe package which annotates text nicely and supports many languages. Here is example of tokenizer which keeps only particular POS:

pos_lemma_tokenizer = function(x, udp_model, pos_keep = c("NOUN", "ADJ", "ADV")) {
  res = as.data.table(udpipe_annotate(udp_model, x = x, doc_id = seq_along(x), parser = "none"))
  setDT(res)
  temp = data.table(doc_id = seq_along(x))
  if(!is.null(pos_keep))
    res = res[upos %in% pos_keep, ]
  res = res[, .(doc_id = as.integer(doc_id), lemma = tolower(lemma))]
  res = res[, .(tokens = list(lemma)), keyby = doc_id]
  res = res[temp, on = .(doc_id = doc_id)]
  res[is.na(tokens), tokens := NULL]
  res$tokens
}

Or you can annotate texts once (because it is time consuming) and then play with different POS combinations.

If it helps please consider to add article to text2vec.org via PR - all articles are here :-)

Thank you for your quick reply. Regarding the first issue: I will try to reproduce the behaviour with movie_review tomorrow, so the issue is easier to investigate/follow (just thought you might have an idea directly).

What follows is off-topic: Thank you very much for your tips and the code snippet, I really appreciate this. I have also just realized that stemming might be too aggressive and an approach via POS tagging might be more fruitful (I am learning step by step by myself, therefore, I am sometimes going into the "wrong" direction). I will play around with a POS tagging approach and hope my machine does not let me down.

I am definitively going to share my study/article to the extent possible. Since its for my PhD I have to select a suitable Journal. Professors might prefer Journal with higher Impact Factor that might, however, have some copyright restrictions. Maybe I can get funding for Open Source. I am currently also discussing with Elsevier if I can share the data set (Scopus download). Maybe it works. I will keep you updated and will share as much as possible.

As promised I have reproduced the behaviour with movie_review - see example below.

When searching for the phrases "SECRET OF"/"OF KELLS" in the docs via negative lookahead/lookbehind it appears that both phrases do only occur as the 3 gram "SECRET OF KELLS" with doc_count=2 and term_count=8, but not individually. However, the vocabulary includes "SECRET_OF_KELLS" with a doc_count=2 and term_count=7 and in addition "SECRET_OF" with a doc_count=1 and term_count=1.

I hope there is no error in the logic of my check. I have just started trying to understand the collocation algorithm, so maybe from the statistical perspective of hierarchical learning the results are fine. Just intuitively, I would have expected only the 3-gram collocation in the vocabulary.

library(text2vec)
packageVersion("text2vec")
#[1] '0.5.0.10'

docs <- movie_review$review
cc_model <- Collocations$new(collocation_count_min = 5
                             ,pmi_min = 1
                             ,gensim_min = 0
                             ,llr_min = 0
                             ,lfmd_min = -25
)
iterator <- itoken(docs, progressbar = F)
cc_model$fit(iterator, n_iter = 5)
# INFO [2018-01-09 08:43:27] iteration 1 - found 2358 collocations
# INFO [2018-01-09 08:43:29] iteration 2 - found 3396 collocations
# INFO [2018-01-09 08:43:31] iteration 3 - found 3601 collocations
# INFO [2018-01-09 08:43:33] iteration 4 - found 3615 collocations
# INFO [2018-01-09 08:43:35] iteration 5 - found 3615 collocations
cc_stat <- cc_model$collocation_stat
iterator <- cc_model$transform(iterator)
vocabulary <- create_vocabulary(iterator)

vocabulary_subset_cc <- vocabulary[grep("_", vocabulary$term),]
# term term_count doc_count
# 1:     _tried          1         1
# 2:      Eyre_          1         1
# 3:  SECRET_OF          1         1
# 4:        G_d          1         1
# 5: Bug's_Life          1         1
# ---                                
# 3573:       is_a       1221       985
# 3574:     on_the       1419      1062
# 3575:       in_a       1516      1167
# 3576:     of_the       3748      2061
# 3577:     in_the       3793      2230

cc_stat[unique(c(grep("^SECRET$|^SECRET_OF", cc_stat$prefix),
          grep("^OF_KELLS|^KELLS", cc_stat$suffix)))
        ,]
#       prefix suffix n_i n_j n_ij      pmi      lfmd    gensim      llr rank_pmi rank_lfmd rank_gensim rank_llr
# 1: SECRET_OF  KELLS   8  10    7 16.23807 -17.65250 22083.975 160.1924       40        97          29     1971
# 2:    SECRET     OF   8  76    8 13.77194 -20.26780  5245.648 153.6090      274       543         166     2072
# 3:        OF  KELLS  76  10    7 13.25737 -21.16766  2797.679 122.0923      334       766         235     2547

vocabulary[grep("SECRET|KELLS", vocabulary$term),]
# term term_count doc_count
# 1:       SECRET_OF          1         1
# 2:          KELLS.          1         1
# 3:           KELLS          3         1
# 4: SECRET_OF_KELLS          7         2

#results of "manual" term_count via stri_count and doc_count via sum(grepl()) (detailed commands see below)
# term              term_count  doc_count
# SECRET_OF_KELLS           8         2
# SECRET_OF                 8         2
# OF_KELLS                  8         2
# SECRET                    8         2
# KELLS                    11         2
# OF                       76        52

#additional check via negative lookbehind
#shows that there is no occurence of "OF KELLS" not preceeded by "SECRET"
sum(stringi::stri_count_regex(docs, "(?<!\\SECRET )OF KELLS\\b"))
#[1] 0
sum(stringi::stri_count_regex(docs, "\\bSECRET OF (?!KELLS)"))
#[1] 0

#-------------------------------detailed commands to get term/doc counts manually
#TERM_COUNT of individual terms/collocations
#3-gram
sum(stringi::stri_count_regex(docs, "\\SECRET OF KELLS\\b"))
#[1] 8
#2-gram
sum(stringi::stri_count_regex(docs, "\\bSECRET OF\\b"))
#[1] 8
sum(stringi::stri_count_regex(docs, "\\bOF KELLS\\b"))
#[1] 8

#1-gram
sum(stringi::stri_count_regex(docs, "\\bSECRET\\b"))
#[1] 8
sum(stringi::stri_count_regex(docs, "\\bKELLS\\b"))
#[1] 11
sum(stringi::stri_count_regex(docs, "\\bOF\\b"))
# [1] 76

#DOC_COUNT of individual terms/collocations
#3-GRAM
sum(grepl("\\SECRET OF KELLS\\b", docs, perl = T))
#[1] 2

#2-gram
sum(grepl("\\SECRET OF\\b", docs, perl = T))
#[1] 2
sum(grepl("\\bOF KELLS\\b", docs, perl = T))
#[1] 2

#1-gram
sum(grepl("\\bSECRET\\b", docs, perl = T))
#[1] 2
sum(grepl("\\bKELLS\\b", docs, perl = T))
#[1] 2
sum(grepl("\\bOF\\b", docs, perl = T))
#[1] 52

I think the issue is that you use regex and word boundary patters when you count collocations manually as opposed to Collocations model where you use simple white-space tokenizer. In the last example the problem is in document number 1195 and id = "11664_9".

library(text2vec)
packageVersion("text2vec")
#[1] '0.5.0.10'

docs <- movie_review$review
cc_model <- Collocations$new(collocation_count_min = 5
                             ,pmi_min = 1
                             ,gensim_min = 0
                             ,llr_min = 0
                             ,lfmd_min = -25
)
iterator <- itoken(docs, progressbar = F)
cc_model$fit(iterator, n_iter = 5)

cc_stat <- cc_model$collocation_stat
iterator <- cc_model$transform(iterator)
vocabulary <- create_vocabulary(iterator)

vocabulary_subset_cc <- vocabulary[grep("_", vocabulary$term), ]

cc_stat[unique(c(grep("^SECRET$|^SECRET_OF", cc_stat$prefix),
                 grep("^OF_KELLS|^KELLS", cc_stat$suffix)))
        ,]

i = grep("SECRET|KELLS", movie_review$review)
it2 = itoken(movie_review$review[i], n_chunks = 1, ids = movie_review$id[i])
it2 = cc_model$transform(it2)

temp = it2$nextElem()
j = which(startsWith(temp$tokens[[2]], "SECRET_OF"))

j
#2 308 343 383 456 524 575

temp$tokens[[2]][j]
#"SECRET_OF_KELLS" "SECRET_OF"       "SECRET_OF_KELLS" "SECRET_OF_KELLS" "SECRET_OF_KELLS" "SECRET_OF_KELLS" "SECRET_OF_KELLS"
temp$tokens[[2]][j + 1]
#"is_an"     "KELLS."    "convey"    "aren't"    "is_what"   "expresses" "is_a"

As you can see the next token for the token 308 ("SECRET_OF") is "KELLS." (mind dot at the end), not "KELLS". So it is not collapsed into phrase. Try to use word_tokenizer to fix such behaviour.

I will leave the issue open - please try and report back if you still have issues.

On the other side there are some cases when vocabulary will contain less terms than collocation_count_min. Suppose collocation_count_min = 10 and consider you have phrases "I have" and "I have been". Now suppose prefixes-suffix pairs ("I", "have"), ("I", "have_been") appeared in collocation_stat 18 and 17 times respectively. So at the end in vocabulary you will have "I_have_been" 17 times and "I_have" only 1 time. This is because multi-word phrases are collapsed iteratively and we can't throw "I_have" without affecting "I_have_been". I will think whether this can be improved...

Thank you very much for your detailed explanation, which helped me to understand the output.

The initial conclusion for me is that after having applied a collocation model, during the pruning of vocabulary one may consider an an offset for collocations (collocation_count_min), depending if collocations such as "I_have" with a low term count are judged as a reasonable collocation and should be kept. I assume that keeping such collocation is still better for downstream analysis than pruning them. Another option might be to "unlearn" collocations below the threshold from the cc_model. If this is reasonable probably depends on the use case.

Anyway, I guess, the difference of the output is probably subtle independent of the approach (not tested, might do that some time if relevant).

dselivanov / text2vec

Logic of collocation output in vocabulary with respect to collocation_count_min? #230