PolMine / cwbtools

Tools to create and manage CWB-indexed corpora
4 stars 2 forks source link

p_attribute_encode() should also have argument `reload` #68

Closed ablaette closed 8 months ago

ablaette commented 8 months ago

We have introduced it with CorpusData$encode() - but p_attribute_encode() should have argument reload to, because running RcppCWB::cl_delete_corpus() and RcppCWB::cqp_reset_registry() is not intuitive at all.

ablaette commented 8 months ago

This is a minimal example I use in another context - but it demonstrates that using the newly created corpus fails unless you call RcppCWB::cqp_reset_registry().

library(cwbtools)
library(magrittr)
library(tokenizers)
library(dplyr)
library(polmineR)

regdir <- fs::path(tempdir(), "regdir")
datadir <- fs::path(tempdir(), "corpusdata", "tmpcorpus")

dir.create(regdir)
dir.create(datadir, recursive = TRUE)

tokenstream <- c("Das müssen Sie einfach hören!") %>% 
  iconv(from = "UTF-8", to = "latin1") %>%
  tokenize_words(lowercase = FALSE, strip_punct = FALSE)

p_attribute_encode(
  token_stream = tokenstream[[1]],
  p_attribute = "word",
  registry_dir = regdir,
  data_dir = datadir,
  corpus = "LATIN1",
  encoding = "latin1",
  method = "CWB",
  compress = FALSE,
  quietly = TRUE
)

RcppCWB::cqp_reset_registry(registry = regdir)

corpus("LATIN1", registry_dir = regdir) %>%
  get_token_stream(p_attribute = "word")