Closed ablaette closed 3 years ago
The processFiles
method might be another way to pursue, see this sketch ...
options(java.parameters = "-Xmx4g")
library(polmineR)
library(rJava)
tagdir <- "/Users/andreasblaette/Lab/tmp/corenlp"
jvm_status <- rJava::.jinit(force.init = TRUE) # does it harm when called again?
stanford_path <- Sys.glob("/opt/stanford-corenlp/stanford-corenlp-4.2.0/*.jar")
rJava::.jaddClassPath(stanford_path)
speakers <- corpus("GERMAPARLMINI") %>%
subset(interjection == "speech") %>%
split(s_attribute = "speaker") %>%
get_token_stream(collapse = " ", beautify = TRUE)
files <- lapply(
1L:length(speakers),
function(i){
f <- file.path(tagdir, sprintf("%d.txt", i))
writeLines(speakers[[i]], con = f)
f
}
)
file_collection <- .jnew(
"edu/stanford/nlp/io/FileSequentialCollection",
.jnew("java/io/File", tagdir),
.jnew("java/lang/String", "txt"),
FALSE
)
properties <- list(
"threads" = "6",
"annotators" = "tokenize, ssplit, pos, lemma, ner",
"tokenize.language" = "de",
"tokenize.postProcessor" = "edu.stanford.nlp.international.german.process.GermanTokenizerPostProcessor",
"pos.model" = "edu/stanford/nlp/models/pos-tagger/german-ud.tagger",
"ner.model" = "edu/stanford/nlp/models/ner/german.distsim.crf.ser.gz",
"ner.applyNumericClassifiers" = "false",
"ner.applyFineGrained" = "false",
"ner.useSUTime" = "false",
"ner.nthreads" = "6",
#####################
"outputFormat" = "json",
"outputDirectory" = "/Users/andreasblaette/Lab/tmp/corenlp/json"
)
props <- rJava::.jnew("java.util.Properties")
lapply(names(properties), function(property) props$put(property, properties[[property]]))
tagger <- rJava::.jnew("edu.stanford.nlp.pipeline.StanfordCoreNLP", props)
system.time(tagger$processFiles(
.jnew("java/lang/String", tagdir),
file_collection,
1L,
FALSE,
J("java/util/Optional")$empty())
)
I see the previous code as a proof of concept that another approach to parallelize StanfordCoreNLP from R is possible.
A note to my self: To use CoreNLP's internal multithreading, it is crucial to set the "threads"-property to the number of cores to be used. Processing time as well as inspecting the output of the top
command line utility (column #TH) confirms that this is the trigger to use Java multithreading.
The confusing part is that there is a method processFiles()' that accepts an argument
numThreads, but I cannot see any effect of the
numThreads` argument. Apparantly, the general setting in the properties is what matters. Good to now this ...
I continue work on this on a branch 'multithreading': https://github.com/PolMine/bignlp/blob/javamultithreading/vignettes/multithreading_v2.Rmd
Once I have consolidated my understanding how we can use the multithreading capabilities of the StanfordCoreNLP class as effectively as possible, I will turn my experiments that are now in the vignette into functions in the package.
The big remaining question is whether json output is the most effective solution. CoNLL might be a real alternative.
The latest version of bignlp on the javamultithreading branch now uses Java multithreading throughout. There are a few very specific issues not yet well understood, but as the general switch to Java multithreading seems to be a great step forward, I cose this issue.