PolMine / bignlp

Tools to process large corpora line-by-line and in parallel mode
1 stars 1 forks source link

Process data that has already been tokenized #17

Open ablaette opened 3 years ago

ablaette commented 3 years ago

Exploring the code of CoreNLP to gain a better understanding how to process strings directly infused from R I realized that the AnnotationPipeline class, though more low-level than the StanfordCoreNLP allows much more control which annotators to use. In particular, you can process Annotationclass objects (in parallel), and you can be very specific about the input data and the annotators.

I am not yet able to spell out how it would work in detail, but it might be (should be) possible to infuse data such that you do not have to start the annotation from scratch.

library(rJava)
library(bignlp)

# remains unused - just ensures that CoreNLP is on classpath
foo <- StanfordCoreNLP$new(
  properties = corenlp_get_properties_file(lang = "de"),
  output_format = "json"
)

anno1 <- .jnew(
  "edu/stanford/nlp/pipeline/Annotation",
  .jnew("java/lang/String", "This is some text to annotate. And here comes another sentence!")
)

anno2 <- .jnew(
  "edu/stanford/nlp/pipeline/Annotation",
  .jnew("java/lang/String", "We want to test whether this works. Would be great.")
)

al <- .jnew("java.util.ArrayList", J("edu/stanford/nlp/pipeline/Annotation"))
al$add(anno1)
al$add(anno2)

pl <- .jnew("edu.stanford.nlp.pipeline.AnnotationPipeline")
pl$addAnnotator(.jnew("edu.stanford.nlp.pipeline.TokenizerAnnotator"))
pl$addAnnotator(.jnew("edu.stanford.nlp.pipeline.WordsToSentencesAnnotator"))
pl$addAnnotator(.jnew("edu.stanford.nlp.pipeline.POSTaggerAnnotator"))

pl$annotate(al)

json_outputter <- .jnew("edu.stanford.nlp.pipeline.JSONOutputter")
cat(.jcall(json_outputter, "Ljava/lang/String;", "print", al$get(0L)))
cat(.jcall(json_outputter, "Ljava/lang/String;", "print", al$get(1L)))
ablaette commented 3 years ago

This is not yet the proof of concept that it will be possible to add annotation layers to text that has already been processed - but it conveys a sense that it is not that much that is missing:

library(rJava)
library(bignlp)

# remains unused - just ensures that CoreNLP is on classpath
foo <- StanfordCoreNLP$new(
  properties = corenlp_get_properties_file(lang = "de"),
  output_format = "json"
)

corelabel_list <- list(
  .jnew(
    "edu/stanford/nlp/ling/CoreLabel",
    .jarray(c("word", "pos", "lemma")),
    .jarray(c("Sehr", "ADJA", "sehr"))
  ),
  .jnew(
    "edu/stanford/nlp/ling/CoreLabel",
    .jarray(c("word", "pos", "lemma")),
    .jarray(c("geehrte", "ADJA", "geehrt"))
  ),
  .jnew(
    "edu/stanford/nlp/ling/CoreLabel",
    .jarray(c("word", "pos", "lemma")),
    .jarray(c("geehrte", "ADJA", "geehrt"))
  ),
  .jnew(
    "edu/stanford/nlp/ling/CoreLabel",
    .jarray(c("word", "pos", "lemma")),
    .jarray(c("Damen", "ADJA", "Dame"))
  ),
  .jnew(
    "edu/stanford/nlp/ling/CoreLabel",
    .jarray(c("word", "pos", "lemma")),
    .jarray(c("und", "ADJA", "und"))
  ),
  .jnew(
    "edu/stanford/nlp/ling/CoreLabel",
    .jarray(c("word", "pos", "lemma")),
    .jarray(c("Herren", "ADJA", "Herren"))
  )
)

al <- .jnew("java.util.ArrayList")
lapply(corelabel_list, function(cl) al$add(cl))

anno <- .jnew("edu/stanford/nlp/pipeline/Annotation")
anno$set(.jnew("edu.stanford.nlp.ling.CoreAnnotations")$TokensAnnotation$class, al)

json_outputter <- .jnew("edu.stanford.nlp.pipeline.JSONOutputter")
cat(.jcall(json_outputter, "Ljava/lang/String;", "print", anno))

pl <- .jnew("edu.stanford.nlp.pipeline.AnnotationPipeline")

pl$addAnnotator(.jnew("edu.stanford.nlp.pipeline.WordsToSentencesAnnotator"))
pl$annotate(anno)
ablaette commented 3 years ago

A new bignlp version I just pushed (dev branch, v0.1.0.9003) includes a new function as.Annotation() that will return a Java reference Annotation object that is derived from an input data.frame. The internals are based on the experimental code of the previous code.

This is now the proof-of-concept that tokenized, tabular data can be transferred to Java for further processing. In addition to the previous experiments, it was only necessary to add some further information (offset positions etc.) to the Java object to avoid and exception.

One limitation is that instantiating every single CoreLabel object takes time. As you need to do this for every single token, the process is really slow. So I do not think, the new functionality is already very useful. To speed up things, two approaches might be considered: (a) Using the TSVSentenceIterator class to generate an Annotation faster, or using the jdx R package to transfer data to the JVM; then it would be necessary to write a (small) class (myself) to turn tabular input data into CoreLabel objects within Java.

Neither option is implemented easily. So doing this, the big obvious alternative should be evaluated, which is to pass a whitespace separated string to Java and to activate the property that a plain and simple whitespace tokenizer should be used. This would also keep the the sequence of tokens stable.

ablaette commented 3 years ago

The vignette now includes a sufficiently detailed explanation of two workflows to add annotation layers to data that has already been tokenized. The only thing that is missing is a test with biggish data that passing whitespace-delimited strings to Java, the recommended option for performance reasons, is indeed as robust as we hope it is.