Open ablaette opened 3 years ago
This is not yet the proof of concept that it will be possible to add annotation layers to text that has already been processed - but it conveys a sense that it is not that much that is missing:
library(rJava)
library(bignlp)
# remains unused - just ensures that CoreNLP is on classpath
foo <- StanfordCoreNLP$new(
properties = corenlp_get_properties_file(lang = "de"),
output_format = "json"
)
corelabel_list <- list(
.jnew(
"edu/stanford/nlp/ling/CoreLabel",
.jarray(c("word", "pos", "lemma")),
.jarray(c("Sehr", "ADJA", "sehr"))
),
.jnew(
"edu/stanford/nlp/ling/CoreLabel",
.jarray(c("word", "pos", "lemma")),
.jarray(c("geehrte", "ADJA", "geehrt"))
),
.jnew(
"edu/stanford/nlp/ling/CoreLabel",
.jarray(c("word", "pos", "lemma")),
.jarray(c("geehrte", "ADJA", "geehrt"))
),
.jnew(
"edu/stanford/nlp/ling/CoreLabel",
.jarray(c("word", "pos", "lemma")),
.jarray(c("Damen", "ADJA", "Dame"))
),
.jnew(
"edu/stanford/nlp/ling/CoreLabel",
.jarray(c("word", "pos", "lemma")),
.jarray(c("und", "ADJA", "und"))
),
.jnew(
"edu/stanford/nlp/ling/CoreLabel",
.jarray(c("word", "pos", "lemma")),
.jarray(c("Herren", "ADJA", "Herren"))
)
)
al <- .jnew("java.util.ArrayList")
lapply(corelabel_list, function(cl) al$add(cl))
anno <- .jnew("edu/stanford/nlp/pipeline/Annotation")
anno$set(.jnew("edu.stanford.nlp.ling.CoreAnnotations")$TokensAnnotation$class, al)
json_outputter <- .jnew("edu.stanford.nlp.pipeline.JSONOutputter")
cat(.jcall(json_outputter, "Ljava/lang/String;", "print", anno))
pl <- .jnew("edu.stanford.nlp.pipeline.AnnotationPipeline")
pl$addAnnotator(.jnew("edu.stanford.nlp.pipeline.WordsToSentencesAnnotator"))
pl$annotate(anno)
A new bignlp version I just pushed (dev branch, v0.1.0.9003) includes a new function as.Annotation()
that will return a Java reference Annotation
object that is derived from an input data.frame
. The internals are based on the experimental code of the previous code.
This is now the proof-of-concept that tokenized, tabular data can be transferred to Java for further processing. In addition to the previous experiments, it was only necessary to add some further information (offset positions etc.) to the Java object to avoid and exception.
One limitation is that instantiating every single CoreLabel
object takes time. As you need to do this for every single token, the process is really slow. So I do not think, the new functionality is already very useful. To speed up things, two approaches might be considered: (a) Using the TSVSentenceIterator class to generate an Annotation
faster, or using the jdx R package to transfer data to the JVM; then it would be necessary to write a (small) class (myself) to turn tabular input data into CoreLabel
objects within Java.
Neither option is implemented easily. So doing this, the big obvious alternative should be evaluated, which is to pass a whitespace separated string to Java and to activate the property that a plain and simple whitespace tokenizer should be used. This would also keep the the sequence of tokens stable.
The vignette now includes a sufficiently detailed explanation of two workflows to add annotation layers to data that has already been tokenized. The only thing that is missing is a test with biggish data that passing whitespace-delimited strings to Java, the recommended option for performance reasons, is indeed as robust as we hope it is.
Exploring the code of CoreNLP to gain a better understanding how to process strings directly infused from R I realized that the
AnnotationPipeline
class, though more low-level than theStanfordCoreNLP
allows much more control which annotators to use. In particular, you can processAnnotation
class objects (in parallel), and you can be very specific about the input data and the annotators.I am not yet able to spell out how it would work in detail, but it might be (should be) possible to infuse data such that you do not have to start the annotation from scratch.