PolMine / polmineR

R-package for text mining with the Corpus Workbench (CWB) as backend
49 stars 9 forks source link

argument "right" gets ignored in cooccurrences(), kwic() (and other functions?) #101

Closed mxi-hug closed 4 years ago

mxi-hug commented 5 years ago

There is an unexpected behavior in the cooccurrences() and kwic() functions. When using the left/right arguments to adjust the window for calculations and display, only the value for "left" is used. The value is applied symmetrically left and right as in the deprecated "window" argument, the value at "right" is ignored.

kwic("GERMAPARL", "Experte", left = 1, right = 0)

cooc <- cooccurrences("GERMAPARL", "Experte", left = 1, right = 0)
sum(cooc@stat$count_coi) # should be 242, is 484
ablaette commented 5 years ago

Great issue - you found two issues at the same time!

The first one that prevented you from setting the left and the right context independently was a fairly obvious error in the kwic()-method for character vectors.

When moving to introduce the corpus/subcorpus-class, this method became a wrapper for the kwic()-method for corpus class objects, and there was an outright error that handed over the value for the argument left to the argument right. This is easy to fix.

The second issue is that a left/right context of 0 tokens does not yet work correctly. Here, the cause of the buggy behaviour is the context()-method that is the worker for the kwic() and the cooccurrences()-methods. The context class includes a data.table reporting which tokens occurr at the individual corpus positions in the left and right context. But the procedure does not work if the value is 0. See the following example.

library(polmineR)
library(data.table)
use("GermaParl")

ex <- context("GERMAPARL", query = "Experte", left = 5, right = 0)
setorderv(x = ex@cpos, cols = c("match_id", "position"))
ex@cpos[match_id == 1]

So what needs to be done is to systematically implement checks for the special case when either the left or the right context is 0. I have started to do so, please give me a bit of time to ensure that I do not provoke unwanted side-effects.

ablaette commented 5 years ago

I think the the development version of polmineR (v0.7.11.9036) solves this issue. Left and right context can differ now, and the left or the right context can be 0 (zero). These are the examples I used.

library(polmineR)
library(data.table)
use("GermaParl")

cooc <- cooccurrences("GERMAPARL", "Experte", left = 1, right = 0)
sum(cooc@stat$count_coi) # 242, as expected

corpus("GERMAPARL") %>%
  subset(year == "2005") %>%
  context(query = "Experte", left = 3, right = 0) %>%
  kwic()

corpus("GERMAPARL") %>%
  context(query = "Experte", left = 5, right = 0) %>%
  kwic()

kwic("GERMAPARL", query = "Experte", left = 5, right = 0)

corpus("GERMAPARL") %>% kwic(query = "Experte", left = 0, right = 5)

Maybe you have additional checks in mind? Would be great to get your feedback. Before closing the issue, I should write unit tests to prevent anything like this from happening again.

mxi-hug commented 5 years ago

Thanks for the fix, this solves the issue. Your checks looks pretty comprehensive to me.