bnosac / udpipe

R package for Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing Based on the UDPipe Natural Language Processing Toolkit
https://bnosac.github.io/udpipe/en
Mozilla Public License 2.0
209 stars 33 forks source link

external pointer is not valid - newly installed udpipe #48

Closed cspenn closed 5 years ago

cspenn commented 5 years ago

R 3.6.0, R Studio, udpipe 0.8.2

Getting "external pointer is not valid" when running udpipe. Model file path is absolute, verified model file exists and is the correct size, 16.7MB.

modelfile <-"/users/cspenn/code/textminingdicts/english-ewt-ud-2.3-181115.udpipe"

## load english model
if (!exists("udmodel_en")) {
  udmodel_en <- udpipe_load_model(modelfile)
}

## first tokenize with udpipe

textdf$doc_id <- seq.int(nrow(textdf))

## split annotation function

# returns a data.table
annotate_splits <- function(x) {
  x <- as.data.table(udpipe_annotate(udmodel_en,
                                     x = x$content,
                                     doc_id = x$doc_id))
  return(x)
}

## run the splits
corpus_splitted <- split(textdf, seq(1, nrow(textdf), by = 100))

## run the multicore
annotation <- future_lapply(corpus_splitted, annotate_splits)

Error text is:

Error in udp_tokenise_tag_parse(object$model, x, doc_id, tokenizer, tagger,  : 
  external pointer is not valid

Had no issues prior to installing udpipe fresh from CRAN after R3.6.0. Verified the model exists in the environment:

Screen Shot 2019-05-16 at 10 30 36 AM

jwijffels commented 5 years ago

This is because udpipe models are Rcpp pointers to files on disk. If you use future_lapply, you are starting parallel processes, for each of this process, the pointer is then lost. The solution to parallelising udpipe annotation is as follows: Just use the function udpipe and use the argument model_dir which is the path to the directory where you downloaded the models. For you it will be english-ewt instead of french-gsd.

library(udpipe)
library(data.table)
library(future.apply)
data(brussels_reviews, package = "udpipe")
textdf <- subset(brussels_reviews, language %in% "fr")
textdf <- data.frame(doc_id = textdf$id, text = textdf$feedback, stringsAsFactors = FALSE)

## Run in multicore
ncores <- 2L
plan(multiprocess, workers = ncores)

anno <- split(textdf, seq(1, nrow(textdf), by = 100))
anno <- future_lapply(anno, FUN=function(x) udpipe(x, "french-gsd", model_dir = "C:/Users/Jan/Dropbox/Work/Courses"))
anno <- rbindlist(anno)

Note that for each split, this load the model again which also takes some time (function udpipe calls udpipe_load_model). So do not put the by argument very low as that does not make sense. Use a reasonable number which depends on the size of your text. Below 100 was taken

cspenn commented 5 years ago

Thank you!