PolMine / polmineR

R-package for text mining with the Corpus Workbench (CWB) as backend
49 stars 9 forks source link

Cooccurrences()-method cannot process zero left and/or right context #117

Closed PolMine closed 4 years ago

PolMine commented 4 years ago

Not a problem if you just want to prepare term-cooccurrence matrices, but for phrase detection (bigram detection), this would be necessary.

PolMine commented 4 years ago

The problem of processzing zero values for left and/or right context had been adressed a while ago. This unit test checks whether everywhing works as expected.

test_that(
  "identity of phrase detection of decode-workflow and Cooccurrences workflow",
  {
    a <- corpus("GERMAPARLMINI") %>%
      decode(p_attribute = "word", s_attribute = character(), to = "data.table", verbose = FALSE) %>%
      ngrams(n = 2L, p_attribute = "word") %>%
      pmi(observed = count("GERMAPARLMINI", p_attribute = "word"))

    b <- Cooccurrences("GERMAPARLMINI", p_attribute = "word", left = 0L, right = 1L, verbose = FALSE) %>%
      decode() %>%
      pmi()

    a_min <- subset(a, ngram_count == 5L) %>% slot("stat") %>% data.table::setorderv(cols = c("word_1", "word_2"))
    b_min <- subset(b, ab_count == 5L) %>% slot("stat") %>% data.table::setorderv(cols = c("a_word", "b_word"))

    expect_identical(nrow(a_min), nrow(b_min))
    expect_identical(a_min[["word_1"]], b_min[["a_word"]])
    expect_identical(a_min[["word_2"]], b_min[["b_word"]])
  }
)