PolMine / bignlp

Tools to process large corpora line-by-line and in parallel mode
1 stars 1 forks source link

License: GPL
v3 R build
status codecov Project Status: Active – The project has reached a stable, usable
state and is being actively
developed. Lifecycle:
maturing

The ‘bignlp’-package: Objectives

R users who want to add a linguistic annotation to a corpus are already offered a few available packages for Natural Language Processing (NLP). These packages are not “pure R” NLP-tools. openNLP, cleanNLP, coreNLP, or spacyr offer interfaces to standard NLP tools implemented in other programming languages. The cleanNLP R package offers to combine these external tools in one coherent framework. So why yet another NLP R package?

As a matter of design, the existing packages are not good to deal with large volumes of text. The thrust of the bignlp-package is to use a standard tool (Stanford CoreNLP) in parallel mode. To be parsimonious with the memory available, it implements line-by-line processing, so that annotated data is not be kept in memory. The bignlp is the package you may want to use if your data is substantially big.

Workflow

There are three steps of the NLP pipeline using bignlp: Turn input data into tabular format, call the annotator that generates an NDJSON output, parse the NDJSON and get a tabular output format. To explain this in more detail:

  1. Input data (XML, for instance) needs to be dissected into a two-column data.table with one column with sections of text (column “text”) and a column “id”. The purpose of the id is to serve as a link to connect chunks with metadata that is stored somewhere else, typically in another table that also has an id column for matching purposes.

  2. The input data.table is processed in single- or multi-threaded mode, using Stanford CoreNLP. The output of the corenlp_annotate()-function is written to one or several NDJSON files (NDJSON stands for newline-delimited JSON). Each line of the NDJSON files is a valid JSON string with the annotation data, and includes the id.

  3. The NDJSON files are processed a line-by-line manner, resulting in a data.table with the chunk ids, and the tokenized and annotated text using the corenlp_parse_ndjson()-function.

Installation

The bignlp package itself is a pure R package. It does not include compiled code, and the installation of the package itself is unlikely to cause problem. However, bignlp depends on the Stanford CoreNLP toolset, a Java library, which requires a specific Java version (1.8). CoreNLP is accessed using the rJava package. Both can cause headaches.

Oracle Java required!

Stanford CoreNLP requires Oracle Java v1.8+, see the CoreNLP Website. Using CoreNLP in combination with OpenJDK, the default Java that is available on many systems, does not work. Errors that result from using the “wrong” Java may be difficult to interpret. So take care to ensure that you have the “correct”" Java (Java Development Kit / JDK) installed.

Oracle Java can be installed from this website. Note that Oracle licenses have become more restrictive, and registration may be necessary. This will not be a problem for academic/non-commercial contexts.

Once downloading and installing Oracle Java 1.8+ is completed, making this Java version the one that is actually used may be tricky. For macOS, we found the jEnv tool to be useful, see this link.

To get R use Java 1.8, you may finalle have to call R CMD javareconf from the command line.

R CMD javareconf

To check within R that everything works, run this snippet from within R.

library(rJava)
.jinit()
.jcall("java/lang/System", "S", "getProperty", "java.runtime.version")

The output you see (something like “1.8.0_211-b12”) on your system depend on the Java version you have installed.

Issues with rJava

An issue we have seen on macOS in combination with RStudio is that the rJava dynamic library called from the RStudio session does not link correctly agains the dynamic library of the Java virtual machine. We use the follwing snippet to induce the correct link. (Note that paths may have to be updated to reflect the correct R and Java version!)

install_name_tool -change "@rpath/libjvm.dylib" \
  "/Library/Java/JavaVirtualMachines/jdk1.8.0_211.jdk/Contents/Home/jre/lib/server/libjvm.dylib" \
  /Library/Frameworks/R.framework/Versions/3.5/Resources/library/rJava/libs/rJava.so

Issues with RStudio

We have seen problems with the parallel annotation with from an RStudio session. Running bignlp from a simple terminal window is a good solution. For a large annotation project, omitting an IDE may be a good idea anyway.