PolMine / bignlp

Tools to process large corpora line-by-line and in parallel mode
1 stars 1 forks source link

Parallel processing using AnnotatorPipeline$annotate() #19

Closed ablaette closed 3 years ago

ablaette commented 3 years ago

This is just a quick proof of concept how calling the annotate() method on iterable objects might work:

library(rJava)
library(bignlp)

S <- StanfordCoreNLP$new(
  properties = corenlp_get_properties_file(lang = "de"),
  output_format = "json"
)

anno1 <- .jnew(
  "edu/stanford/nlp/pipeline/Annotation",
  .jnew("java.lang.String", "Das ist ein Satz.")
  )
anno2 <- .jnew(
  "edu/stanford/nlp/pipeline/Annotation",
  .jnew("java.lang.String", "Das ist anderer Satz.")
)
anno3 <- .jnew(
  "edu/stanford/nlp/pipeline/Annotation",
  .jnew("java.lang.String", "Das ist anderer Satz. Und noch ein Satz")
)

arr <- .jarray(list(anno1, anno2, anno3))
a <- .jnew("java.util.Arrays")$asList(arr)

# it <- .jnew("edu.stanford.nlp.util.Iterables")
# i <- it$chain(a)

props <- bignlp::properties(corenlp_get_properties_file(lang = "en", fast = TRUE))
s <- .jnew("edu.stanford.nlp.pipeline.StanfordCoreNLP", props)
s$annotate(i)

json_outputter <- .jnew("edu.stanford.nlp.pipeline.JSONOutputter")
cat(.jcall(json_outputter, "Ljava/lang/String;", "print", anno3))

# or more low-level

pl <- .jnew("edu.stanford.nlp.pipeline.AnnotationPipeline")
pl$addAnnotator(.jnew("edu.stanford.nlp.pipeline.TokenizerAnnotator"))
pl$addAnnotator(.jnew("edu.stanford.nlp.pipeline.WordsToSentencesAnnotator"))
pl$annotate(a)
ablaette commented 3 years ago

The package now includes the class AnnotationPipeline: https://github.com/PolMine/bignlp/blob/master/R/AnnotationPipeline.R

We should still get a better understanding when memory limitations occurr, but it is the fastest way to process a corpus so far I think.