Closed PolMine closed 4 years ago
The problem of processzing zero values for left and/or right context had been adressed a while ago. This unit test checks whether everywhing works as expected.
test_that(
"identity of phrase detection of decode-workflow and Cooccurrences workflow",
{
a <- corpus("GERMAPARLMINI") %>%
decode(p_attribute = "word", s_attribute = character(), to = "data.table", verbose = FALSE) %>%
ngrams(n = 2L, p_attribute = "word") %>%
pmi(observed = count("GERMAPARLMINI", p_attribute = "word"))
b <- Cooccurrences("GERMAPARLMINI", p_attribute = "word", left = 0L, right = 1L, verbose = FALSE) %>%
decode() %>%
pmi()
a_min <- subset(a, ngram_count == 5L) %>% slot("stat") %>% data.table::setorderv(cols = c("word_1", "word_2"))
b_min <- subset(b, ab_count == 5L) %>% slot("stat") %>% data.table::setorderv(cols = c("a_word", "b_word"))
expect_identical(nrow(a_min), nrow(b_min))
expect_identical(a_min[["word_1"]], b_min[["a_word"]])
expect_identical(a_min[["word_2"]], b_min[["b_word"]])
}
)
Not a problem if you just want to prepare term-cooccurrence matrices, but for phrase detection (bigram detection), this would be necessary.