PolMine / dbpedia

R Wrapper for Corpus Annotation with DBpedia Spotlight
3 stars 0 forks source link

get_dbpedia_uris() aborts: Error: protect(): protection stack overflow #52

Closed ablaette closed 6 months ago

ablaette commented 7 months ago

Running this ...

library(polmineR)
library(RcppCWB)
library(dbpedia)
library(dplyr)

p_size <- cl_attribute_size(corpus = "GERMAPARL2", attribute = "p", attribute_type = "s")

p_strucs <- s_attr("GERMAPARL2", s_attribute = "ne", registry = Sys.getenv("CORPUS_REGISTRY")) %>%
  s_attr_size()  %>%
  (`-`)(1) %>%
  seq(from = 0L, to = .) %>%
  get_region_matrix(corpus = "GERMAPARL2", s_attribute = "ne", strucs = .) %>%
  .[, 1L] %>%
  cl_cpos2struc(corpus = "GERMAPARL2", s_attribute = "p", cpos = .) %>%
  unique()

logfile <- tempfile()
message("Using logfile: ", logfile)

decade_regex <- sprintf("^%d\\d-\\d{2}-\\d{2}", decade)

paras <- corpus("GERMAPARL2") %>%
  subset(p %in% !!p_strucs_speech) %>%
  subset(grepl(!!decade_regex, protocol_date)) %>%
  split(s_attribute = "p", values = FALSE)

uritab_paragraphs <- get_dbpedia_uris(
  x = paras,
  language = getOption("dbpedia.lang"),
  max_len = 5600L,
  confidence = 0.35,
  support = 20, 
  api = getOption("dbpedia.endpoint"),
  logfile = logfile,
  retry = 3,
  verbose = FALSE,
  expand_to_token = TRUE,
  progress = TRUE,
  s_attribute = "ne_type"
)

Results in this error: Error: protect(): protection stack overflow

See this at Stackoverflow as a potential solution: https://stackoverflow.com/questions/32826906/how-to-solve-protection-stack-overflow-issue-in-r-studio

So should I include something such as

options(expressions = 5e5)

before this expression?

ablaette commented 6 months ago

The error does not occurr when I break up the entire corpus into smaller pieces (legistlative periods), but I see it for the 17th legislative period of GERMAPARL2. Observations:

To be tested experimentally:

options(expressions = 5e5)

We might also look at Cstack_info()

This is a minimal version of the code I used that resulted in the error:

library(RcppCWB)
library(polmineR)
library(dplyr)
library(dbpedia)

logfile <- tempfile()

p_strucs <- s_attr("GERMAPARL2", s_attribute = "ne", registry = Sys.getenv("CORPUS_REGISTRY")) %>%
  s_attr_size()  %>%
  (`-`)(1) %>%
  seq(from = 0L, to = .) %>%
  get_region_matrix(corpus = "GERMAPARL2", s_attribute = "ne", strucs = .) %>%
  .[, 1L] %>%
  cl_cpos2struc(corpus = "GERMAPARL2", s_attribute = "p", cpos = .) %>%
  unique()

p_types <- cl_struc2str(corpus = "GERMAPARL2", s_attribute = "p_type", struc = p_strucs)
p_strucs_speech <- p_strucs[which(p_types == "speech")]

paras <- corpus("GERMAPARL2") %>%
  subset(p %in% !!p_strucs_speech) %>%
  subset(protocol_lp == "17") %>%
  split(s_attribute = "p", values = FALSE)

uritab <- get_dbpedia_uris(
  x = paras,
  language = getOption("dbpedia.lang"),
  max_len = 5600L,
  confidence = 0.35,
  support = 20, 
  api = getOption("dbpedia.endpoint"),
  logfile = logfile,
  retry = 3,
  verbose = FALSE,
  expand_to_token = TRUE,
  progress = TRUE,
  s_attribute = "ne_type"
)
ablaette commented 6 months ago

To get a better understanding of the issue, I tried to provoke it as follows: But it works without a problem, unfortunately. How can we provoke the error?

library(data.table)
dt <- data.table(
  A = 1:100,
  B = 1:100,
  C = 1:100,
  D = 1:100,
  E = 1:100,
  F = 1:100,
  G = 1:100,
  H = 1:100,
  I = 1:100,
  J = rep(list(a = "asdf", b = "asdf", c = "sdf"), times = 100)
)
dts <- lapply(1:500000, function(i) copy(dt))
foo <- rbindlist(dts)
ablaette commented 6 months ago

Confirmed: The error does not occur when we drop the column "types" with list values. Dropping the column is implemented now only for get_dbpedia_uris() for subcorpus_bundle objects. A consistent implementation is a to do.

ChristophLeonhardt commented 6 months ago

I think I can second that.

With the nested lists in types, calling rbindlist() results in the error you described when the list of data.tables returned within get_dbpedia_uris() gets large. Dropping the types column seems to be a good solution. Information on types can be stored in other ways given the mechanism around types_src.

ablaette commented 6 months ago

We now have the argument types_drop to remove the 'types' column, and the protect-issue disappears when dropping the column. So it is now a matter of documentation to convey this point.

ablaette commented 6 months ago

I added a paragraph explaining this issue in the documentation of the get_dbpedia_uris()-method.