PolMine / bignlp

Tools to process large corpora line-by-line and in parallel mode
1 stars 1 forks source link

Use CoreNLP parallelization #14

Closed ablaette closed 3 years ago

ablaette commented 3 years ago
props <- rJava::.jnew("java.util.Properties")
props$setProperty("annotators", "tokenize, ssplit, pos, lemma, ner")
props$setProperty("tokenize.language", "de")
props$setProperty("pos.model", "edu/stanford/nlp/models/pos-tagger/german-ud.tagger")
props$setProperty("ner.model", "edu/stanford/nlp/models/ner/german.distsim.crf.ser.gz")
props$put("threads", "6")
tagger <- rJava::.jnew("edu.stanford.nlp.pipeline.StanfordCoreNLP", props)
jsonifier <- rJava::.jnew("edu.stanford.nlp.pipeline.JSONOutputter")
system.time(anno <- rJava::.jcall(tagger, "Ledu/stanford/nlp/pipeline/Annotation;", "process", "Das ist ein Satz."))
json_string <- rJava::.jcall(jsonifier, "Ljava/lang/String;", "print", anno)
ablaette commented 3 years ago

The processFiles method might be another way to pursue, see this sketch ...

options(java.parameters = "-Xmx4g")

library(polmineR)
library(rJava)

tagdir <- "/Users/andreasblaette/Lab/tmp/corenlp"

jvm_status <- rJava::.jinit(force.init = TRUE) # does it harm when called again?
stanford_path <- Sys.glob("/opt/stanford-corenlp/stanford-corenlp-4.2.0/*.jar")
rJava::.jaddClassPath(stanford_path)

speakers <- corpus("GERMAPARLMINI") %>%
  subset(interjection == "speech") %>%
  split(s_attribute = "speaker") %>% 
  get_token_stream(collapse = " ", beautify = TRUE)

files <- lapply(
  1L:length(speakers),
  function(i){
    f <- file.path(tagdir, sprintf("%d.txt", i))
    writeLines(speakers[[i]], con = f)
    f
  }
)

file_collection <- .jnew(
  "edu/stanford/nlp/io/FileSequentialCollection",
  .jnew("java/io/File", tagdir),
  .jnew("java/lang/String", "txt"),
  FALSE
)

properties <- list(
  "threads" = "6",
  "annotators" = "tokenize, ssplit, pos, lemma, ner",
  "tokenize.language" = "de",
  "tokenize.postProcessor" = "edu.stanford.nlp.international.german.process.GermanTokenizerPostProcessor",
  "pos.model" = "edu/stanford/nlp/models/pos-tagger/german-ud.tagger",
  "ner.model" = "edu/stanford/nlp/models/ner/german.distsim.crf.ser.gz",
  "ner.applyNumericClassifiers" = "false",
  "ner.applyFineGrained" = "false",
  "ner.useSUTime" = "false",
  "ner.nthreads" = "6",

  #####################

  "outputFormat" = "json",
  "outputDirectory" = "/Users/andreasblaette/Lab/tmp/corenlp/json"
)

props <- rJava::.jnew("java.util.Properties")
lapply(names(properties), function(property) props$put(property, properties[[property]]))

tagger <- rJava::.jnew("edu.stanford.nlp.pipeline.StanfordCoreNLP", props)

system.time(tagger$processFiles(
  .jnew("java/lang/String", tagdir),
  file_collection,
  1L,
  FALSE,
  J("java/util/Optional")$empty())
)
ablaette commented 3 years ago

I see the previous code as a proof of concept that another approach to parallelize StanfordCoreNLP from R is possible.

ablaette commented 3 years ago

A note to my self: To use CoreNLP's internal multithreading, it is crucial to set the "threads"-property to the number of cores to be used. Processing time as well as inspecting the output of the top command line utility (column #TH) confirms that this is the trigger to use Java multithreading.

The confusing part is that there is a method processFiles()' that accepts an argumentnumThreads, but I cannot see any effect of thenumThreads` argument. Apparantly, the general setting in the properties is what matters. Good to now this ...

I continue work on this on a branch 'multithreading': https://github.com/PolMine/bignlp/blob/javamultithreading/vignettes/multithreading_v2.Rmd

Once I have consolidated my understanding how we can use the multithreading capabilities of the StanfordCoreNLP class as effectively as possible, I will turn my experiments that are now in the vignette into functions in the package.

The big remaining question is whether json output is the most effective solution. CoNLL might be a real alternative.

ablaette commented 3 years ago

The latest version of bignlp on the javamultithreading branch now uses Java multithreading throughout. There are a few very specific issues not yet well understood, but as the general switch to Java multithreading seems to be a great step forward, I cose this issue.