bnosac / udpipe

R package for Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing Based on the UDPipe Natural Language Processing Toolkit
https://bnosac.github.io/udpipe/en
Mozilla Public License 2.0
209 stars 33 forks source link

udpipe_annotate in parallel #52

Closed kanishkamisra closed 5 years ago

kanishkamisra commented 5 years ago

Hi!

I am trying to get udpipe_annotate to work in parallel with the following code:

library(udpipe)
library(data.table)
library(future.apply)
library(janeaustenr)

ud_english <- udpipe_load_model("~/udpipe/english-ewt-ud-2.3-181115.udpipe")

plan(multiprocess, workers = 4L)
x <- janeaustenr::emma[1:1000]
anno <- split(x, seq(1, length(x), by = 50))
anno <- future_lapply(anno, FUN=function(z) udpipe_annotate(object = ud_english, z))

However, it gives me the following error:

Error in udp_tokenise_tag_parse(object$model, x, doc_id, tokenizer, tagger,  : 
  external pointer is not valid 

If I do it with lapply which would only use 1 core, it gives no such error and returns all the collnu formatted parses. Does the udp_tokenize_tag_parse() function not work when called in parallel?

jwijffels commented 5 years ago

The udpipe r package version 0.8.2 has support for parallelisation when you use the udpipe function. See the vignette: https://cran.r-project.org/web/packages/udpipe/vignettes/udpipe-parallel.html For your simple example that would be: udpipe(x, "~/udpipe/english-ewt-ud-2.3-181115.udpipe", parallel.cores = 8)

kanishkamisra commented 5 years ago

Hi! Thanks for your reply, I was wondering if I could get udpipe_annotate() to run in parallel instead of udpipe() since I dont want to store all the sentences which are repeated when I use udpipe() in the sentence column, infact I only wanted the governor-dependent and dependency relation which the udpipe_annotate() provides :(

jwijffels commented 5 years ago

Nothing stops you from providing your own function as show in the last part at the vignette https://cran.r-project.org/web/packages/udpipe/vignettes/udpipe-parallel.html, do the annotation and keep only the columns that you like if you want to save space.

kanishkamisra commented 5 years ago

Yep so I switched udpipe to udpipe_annotate() as shown in my example with the future_lapply call and got the error:

Error in udp_tokenise_tag_parse(object$model, x, doc_id, tokenizer, tagger,  : 
  external pointer is not valid 

I believe it might be something to do with udp_tokenise_tag_parse() not working in parallel, or maybe my call is wrong but if I switch with any other function future_lapply still works

For now I can try not including the unwanted columns from udpipe() but I'd prefer if the sentences were not stored in the first place for large corpora (in my use case only)

jwijffels commented 5 years ago

Please use the example shown in the vignette https://cran.r-project.org/web/packages/udpipe/vignettes/udpipe-parallel.html

jwijffels commented 5 years ago

@kanishkamisra have you understood from the vignette that the udpipe_load_model needs to be done inside your FUN?

kanishkamisra commented 5 years ago

yep! thanks