PolMine / bignlp

Tools to process large corpora line-by-line and in parallel mode
1 stars 1 forks source link

Memory Usage with large data #3

Closed ChristophLeonhardt closed 3 years ago

ChristophLeonhardt commented 5 years ago

Tagging a chunkdata-file of 3,5 GB with one core and the following code:

CD$tokenstream <- chunk_table_split(chunkdata_file, output = NULL, n = no_cores, verbose = TRUE) %>% corenlp_annotate(threads = no_cores, byline = TRUE, progress = interactive()) %>% corenlp_parse_ndjson(cols_to_keep = c("id", p_attrs), output = tsv_file_tagged, threads = no_cores, progress = interactive()) %>% lapply(fread) %>% rbindlist()

results in a usage of about 22 GB of RAM by R although JVM is initialized with a limit of 4GB:

options(java.parameters = "-Xmx4g")

If the same operation is performed with more cores (i.e. the chunk file is split), each R process (one per core) will use around 22GB of RAM.

Edit: Just checked the package version: This problem occurs on bignlp 0.5.0. There is one newer version (0.6.0). However, I don't see how the changes would affect this behavior.

PolMine commented 5 years ago

Oops. Thanks for raising the issue. As I see it, the error occurs at the stage of turning the ndjson output of CoreNLP into a tabular format. This results from chaining the methods in a pipe, which omits doing the line-by-line processing of large input documents at this stage. Certainly, I provoked this issue by providing examples using pipes. It is nice and we may like the look and feel of pipes, but it is not a good approach for big, lengthy processes when you want to see precisely where a crash occurs.

Trying to prepare a fully reproducible example, I updated the package that fixes (minor) issues. The biggest invervention is to update the CoreNLP version. So now we rely on v3.9.2 (released on October 5, 2018). You can install the new version from GitHub.

devtools::install_github("PolMine/bignlp", ref = "dev")

The scenario we want to deal with is that we have a huge corpus, which we want to process in parallel because of its size. And we want to do this in a line-by-line manner, because at certain stages, everything does not fit into memory.

An important note in advance: This will not work in RStudio - please work in a stoneage terminal window.

The example I will use here is modified from the package vignette. In addition to bignlp, we will need cwbtools and the data.table package.

library(bignlp)
library(cwbtools)
library(data.table)

Generally speaking, I would recommend to keep CoreNLP at a place outside the R packages. The installation is explained in an annex of the package vignette. To inform bignlp about the whereabouts of the package, define the respective option.

options(bignlp.corenlp_dir = "/opt/stanford-corenlp/stanford-corenlp-full-2018-10-05/")

For convenience, you can also use a function for the installation offered as a function. But note that whenever you update the bignlp package, the big jar files of CoreNLP are deleted, and downloading the files takes a while.

bignlp::corenlp_install(lang = "de")
options(bignlp.corenlp_dir = corenlp_get_jar_dir())
options(bignlp.properties_file = corenlp_get_properties_file(lang = "en", fast = TRUE))

In the example in the vignette, we enlarge the size of a Java virtual machine. We do it here to, just to be sure. But actually reserving memory for the virtual machines is taken care of in the forks.

options(java.parameters = "-Xmx4g") # needs to be set before a JVM is initialized.

We want to use as many cores as possible, leaving one for other business on our machine.

no_cores <- parallel::detectCores() - 1L

Now, annotating the corpus in parallel in a line-by-line manner requires us to take somewhat more care about the files that are used and generated at the different stages. Temporary files are created implicitly by the functions for the pipeline, but here it is better, to do it explicitly.

outdir <- tempdir()
tsv_file <- file.path(outdir, "unga.tsv")
tsv_file_chunks <- file.path(outdir, sprintf("unga_%d.tsv", 1L:no_cores))
ndjson_files <- file.path(outdir, sprintf("unga_%d.ndjson", 1L:no_cores))
tsv_files_tagged <- file.path(outdir, sprintf("unga_tagged_%d.tsv", 1L:no_cores))

The line-by-line processing mode attaches the output to existing files. Because bignlp is not yet totally robust, we delete files that may have remained from previous runs.

if (file.exists(tsv_file)) file.remove(tsv_file)
if (any(file.exists(tsv_file_chunks))) file.remove(tsv_file_chunks)
if (any(file.exists(ndjson_files))) file.remove(ndjson_files)
if (any(file.exists(tsv_files_tagged)))file.remove(tsv_files_tagged)

So let us begin. As a first step, we read in a few documents and generate a tsv file.

unga_xml_files <- list.files(system.file(package = "bignlp", "extdata", "xml"), full.names = TRUE)
CD <- CorpusData$new()
CD$import_xml(filenames = unga_xml_files)
fwrite(x = CD$chunktable, file = tsv_file, sep = "\t") # add sep for tab as seperator!!!

We split up this big table into chunks for parallel processing.

tsv_files <- chunk_table_split(
  input = tsv_file,
  output = tsv_file_chunks,
  n = no_cores,
  verbose = interactive()
)

Now we let CoreNLP do the actual annotation task (in parallel). There will be some (confusing) output messages on the annotators that are loaded by the different subprocesses.

ndjson_files <- corenlp_annotate(
  input = tsv_files,
  output = ndjson_files,
  threads = no_cores,
  byline = TRUE,
  method = "json",
  progress = interactive()
)

This is the step that did not work for you: The corenlp_parse_ndjson function is not yet developed and documented nicely. For reasons I cannot really explain, byline processing works only if the progress argument is FALSE. And in this byline-scenario, parallel processing is not yet implemented. And you will not see progress messages apart from a message on the file that is processed. So be patient, and confident that byline processing will be able to handle big corpora without running out of memory.

tsv_files_tagged <- corenlp_parse_ndjson(
  input = ndjson_files,
  cols_to_keep = c("id", "sentence", "word", "lemma", "pos"),
  output = tsv_files_tagged, ### multiple ndjson files!!!
  threads = no_cores,
  progress = FALSE
)

The tabular format that is now on disk is sufficiently parsimonious that even a large corpus may fit into memory. So let us get the data again ...

CD$tokenstream <- rbindlist(lapply(tsv_files_tagged, fread))

Hope it works for you. Please let me know whether it does.

ChristophLeonhardt commented 5 years ago

ndjson_files <- corenlp_annotate( input = tsv_files, output = ndjson_files, threads = no_cores, byline = TRUE, method = "json", progress = TRUE)

With a sample of 1000 xmls and ten threads, after a while, processes start to die:

total: 3% job 1: 3% job 2: 3% job 3: 3% job 4: 3% job 5: 3% job 6: 3% job 7: 2% job 8: > 3% job 9: 3% job 10: 3%Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x00000007afa00000, 8388608, 0) failed; error='Nicht genügend Hauptspeicher verfügbar' (errno=12)

There is insufficient memory for the Java Runtime Environment to continue. Native memory allocation (mmap) failed to map 8388608 bytes for committing reserved memory. ... An error report file with more information is saved as: /.../xxx.log

This doesn't necessarily result in the discontinuation of the entire annotation process. Some other cores are still running until, some minutes later, a new error occurs:

Fehler in unserialize(f) : unbekanntes Eingabeformat Zusätzlich: Es gab 50 oder mehr Warnungen (Anzeige der ersten 50 mit warnings())

Which does end the annotation all together. I will check with fewer cores if behavior this still persists.

ablaette commented 4 years ago

I think I found the culprit: Garbage collection is not triggered automatically, we have to do it manually. Please use the newest version of bignlp, branch logging, and see the following code.

options(java.parameters = "-Xmx4g") # needs to be set before a JVM is initialized.
noCores <- parallel::detectCores() - 2L

library(data.table)
library(cwbtools)
library(bignlp)

outdir <- tempdir()
tsv_file <- file.path(outdir, "unga.tsv")
ndjson_file <- file.path(outdir, "unga.ndjson")
tsv_file_tagged <- file.path(outdir, "unga_tagged.tsv")

options(bignlp.properties_file = bignlp::corenlp_get_properties_file(lang = "en", fast = TRUE))

unga_xml_files <- list.files(
  system.file(package = "bignlp", "extdata", "xml"),
  full.names = TRUE
)
CD <- CorpusData$new()
CD$import_xml(filenames = unga_xml_files)

corenlp_annotate(input = CD$chunktable, output = ndjson_file, progress = FALSE, logfile = NULL, report_interval = 10L, gc_interval = 100L, threads = 1L)

If you run the last line a couple of times,. with different gc_interval settings, you will see the difference it makes.

ablaette commented 3 years ago

I am really sure that this issue is gone as we use the internal Java parallelization.