Closed ablaette closed 6 months ago
The error does not occurr when I break up the entire corpus into smaller pieces (legistlative periods), but I see it for the 17th legislative period of GERMAPARL2. Observations:
get_dbpedia_uris()
are successful. data.table
objects that is passed into rbindlist()
To be tested experimentally:
options(expressions = 5e5)
We might also look at Cstack_info()
This is a minimal version of the code I used that resulted in the error:
library(RcppCWB)
library(polmineR)
library(dplyr)
library(dbpedia)
logfile <- tempfile()
p_strucs <- s_attr("GERMAPARL2", s_attribute = "ne", registry = Sys.getenv("CORPUS_REGISTRY")) %>%
s_attr_size() %>%
(`-`)(1) %>%
seq(from = 0L, to = .) %>%
get_region_matrix(corpus = "GERMAPARL2", s_attribute = "ne", strucs = .) %>%
.[, 1L] %>%
cl_cpos2struc(corpus = "GERMAPARL2", s_attribute = "p", cpos = .) %>%
unique()
p_types <- cl_struc2str(corpus = "GERMAPARL2", s_attribute = "p_type", struc = p_strucs)
p_strucs_speech <- p_strucs[which(p_types == "speech")]
paras <- corpus("GERMAPARL2") %>%
subset(p %in% !!p_strucs_speech) %>%
subset(protocol_lp == "17") %>%
split(s_attribute = "p", values = FALSE)
uritab <- get_dbpedia_uris(
x = paras,
language = getOption("dbpedia.lang"),
max_len = 5600L,
confidence = 0.35,
support = 20,
api = getOption("dbpedia.endpoint"),
logfile = logfile,
retry = 3,
verbose = FALSE,
expand_to_token = TRUE,
progress = TRUE,
s_attribute = "ne_type"
)
To get a better understanding of the issue, I tried to provoke it as follows: But it works without a problem, unfortunately. How can we provoke the error?
library(data.table)
dt <- data.table(
A = 1:100,
B = 1:100,
C = 1:100,
D = 1:100,
E = 1:100,
F = 1:100,
G = 1:100,
H = 1:100,
I = 1:100,
J = rep(list(a = "asdf", b = "asdf", c = "sdf"), times = 100)
)
dts <- lapply(1:500000, function(i) copy(dt))
foo <- rbindlist(dts)
Confirmed: The error does not occur when we drop the column "types" with list values. Dropping the column is implemented now only for get_dbpedia_uris()
for subcorpus_bundle
objects. A consistent implementation is a to do.
I think I can second that.
With the nested lists in types
, calling rbindlist()
results in the error you described when the list of data.tables returned within get_dbpedia_uris()
gets large. Dropping the types column seems to be a good solution. Information on types can be stored in other ways given the mechanism around types_src
.
We now have the argument types_drop
to remove the 'types' column, and the protect-issue disappears when dropping the column. So it is now a matter of documentation to convey this point.
I added a paragraph explaining this issue in the documentation of the get_dbpedia_uris()
-method.
Running this ...
Results in this error: Error: protect(): protection stack overflow
See this at Stackoverflow as a potential solution: https://stackoverflow.com/questions/32826906/how-to-solve-protection-stack-overflow-issue-in-r-studio
So should I include something such as
before this expression?