Closed ablaette closed 8 months ago
This is a minimal example I use in another context - but it demonstrates that using the newly created corpus fails unless you call RcppCWB::cqp_reset_registry()
.
library(cwbtools)
library(magrittr)
library(tokenizers)
library(dplyr)
library(polmineR)
regdir <- fs::path(tempdir(), "regdir")
datadir <- fs::path(tempdir(), "corpusdata", "tmpcorpus")
dir.create(regdir)
dir.create(datadir, recursive = TRUE)
tokenstream <- c("Das müssen Sie einfach hören!") %>%
iconv(from = "UTF-8", to = "latin1") %>%
tokenize_words(lowercase = FALSE, strip_punct = FALSE)
p_attribute_encode(
token_stream = tokenstream[[1]],
p_attribute = "word",
registry_dir = regdir,
data_dir = datadir,
corpus = "LATIN1",
encoding = "latin1",
method = "CWB",
compress = FALSE,
quietly = TRUE
)
RcppCWB::cqp_reset_registry(registry = regdir)
corpus("LATIN1", registry_dir = regdir) %>%
get_token_stream(p_attribute = "word")
We have introduced it with
CorpusData$encode()
- butp_attribute_encode()
should have argumentreload
to, because runningRcppCWB::cl_delete_corpus()
andRcppCWB::cqp_reset_registry()
is not intuitive at all.