Use CoreNLP parallelization

ablaette commented 3 years ago

props <- rJava::.jnew("java.util.Properties")
props$setProperty("annotators", "tokenize, ssplit, pos, lemma, ner")
props$setProperty("tokenize.language", "de")
props$setProperty("pos.model", "edu/stanford/nlp/models/pos-tagger/german-ud.tagger")
props$setProperty("ner.model", "edu/stanford/nlp/models/ner/german.distsim.crf.ser.gz")
props$put("threads", "6")
tagger <- rJava::.jnew("edu.stanford.nlp.pipeline.StanfordCoreNLP", props)
jsonifier <- rJava::.jnew("edu.stanford.nlp.pipeline.JSONOutputter")
system.time(anno <- rJava::.jcall(tagger, "Ledu/stanford/nlp/pipeline/Annotation;", "process", "Das ist ein Satz."))
json_string <- rJava::.jcall(jsonifier, "Ljava/lang/String;", "print", anno)

ablaette commented 3 years ago

The processFiles method might be another way to pursue, see this sketch ...

options(java.parameters = "-Xmx4g")

library(polmineR)
library(rJava)

tagdir <- "/Users/andreasblaette/Lab/tmp/corenlp"

jvm_status <- rJava::.jinit(force.init = TRUE) # does it harm when called again?
stanford_path <- Sys.glob("/opt/stanford-corenlp/stanford-corenlp-4.2.0/*.jar")
rJava::.jaddClassPath(stanford_path)

speakers <- corpus("GERMAPARLMINI") %>%
  subset(interjection == "speech") %>%
  split(s_attribute = "speaker") %>% 
  get_token_stream(collapse = " ", beautify = TRUE)

files <- lapply(
  1L:length(speakers),
  function(i){
    f <- file.path(tagdir, sprintf("%d.txt", i))
    writeLines(speakers[[i]], con = f)
    f
  }
)

file_collection <- .jnew(
  "edu/stanford/nlp/io/FileSequentialCollection",
  .jnew("java/io/File", tagdir),
  .jnew("java/lang/String", "txt"),
  FALSE
)

properties <- list(
  "threads" = "6",
  "annotators" = "tokenize, ssplit, pos, lemma, ner",
  "tokenize.language" = "de",
  "tokenize.postProcessor" = "edu.stanford.nlp.international.german.process.GermanTokenizerPostProcessor",
  "pos.model" = "edu/stanford/nlp/models/pos-tagger/german-ud.tagger",
  "ner.model" = "edu/stanford/nlp/models/ner/german.distsim.crf.ser.gz",
  "ner.applyNumericClassifiers" = "false",
  "ner.applyFineGrained" = "false",
  "ner.useSUTime" = "false",
  "ner.nthreads" = "6",

  #####################

  "outputFormat" = "json",
  "outputDirectory" = "/Users/andreasblaette/Lab/tmp/corenlp/json"
)

props <- rJava::.jnew("java.util.Properties")
lapply(names(properties), function(property) props$put(property, properties[[property]]))

tagger <- rJava::.jnew("edu.stanford.nlp.pipeline.StanfordCoreNLP", props)

system.time(tagger$processFiles(
  .jnew("java/lang/String", tagdir),
  file_collection,
  1L,
  FALSE,
  J("java/util/Optional")$empty())
)

ablaette commented 3 years ago

I see the previous code as a proof of concept that another approach to parallelize StanfordCoreNLP from R is possible.

From a table with chunks of text to be processed, create individual files with one single chunk in each file. Put n (e.g. 1000) of these files in one directory.
Process these files with StanfordCore NLP (plain text to JSON).
Parse JSON and prepare table with linguistic annotation.

ablaette commented 3 years ago

A note to my self: To use CoreNLP's internal multithreading, it is crucial to set the "threads"-property to the number of cores to be used. Processing time as well as inspecting the output of the top command line utility (column #TH) confirms that this is the trigger to use Java multithreading.

The confusing part is that there is a method processFiles()' that accepts an argumentnumThreads, but I cannot see any effect of thenumThreads` argument. Apparantly, the general setting in the properties is what matters. Good to now this ...

I continue work on this on a branch 'multithreading': https://github.com/PolMine/bignlp/blob/javamultithreading/vignettes/multithreading_v2.Rmd

Once I have consolidated my understanding how we can use the multithreading capabilities of the StanfordCoreNLP class as effectively as possible, I will turn my experiments that are now in the vignette into functions in the package.

The big remaining question is whether json output is the most effective solution. CoNLL might be a real alternative.

ablaette commented 3 years ago

The latest version of bignlp on the javamultithreading branch now uses Java multithreading throughout. There are a few very specific issues not yet well understood, but as the general switch to Java multithreading seems to be a great step forward, I cose this issue.

PolMine / bignlp

Use CoreNLP parallelization #14