PolMine / bignlp

Tools to process large corpora line-by-line and in parallel mode
1 stars 1 forks source link

No Output for corenlp_parse_ndjson #4

Closed ChristophLeonhardt closed 3 years ago

ChristophLeonhardt commented 5 years ago

tsv_files_tagged <- corenlp_parse_ndjson( input = ndjson_files, cols_to_keep = c(„id“, p_attrs), output = tsv_files_tagged, threads = no_cores, byline = TRUE, progress = TRUE, verbose = TRUE # this doesn't change anything )

Parsing 10 ndjson-files with a size of 10 gigabytes each, this code doesn't produce any output, neither in the console nor in R Studio. After seven minutes of running, no output file was produced either when running in R Studio.

PolMine commented 5 years ago

Having a progress bar and parallelization at the same time is a very tricky operation. After implementing several experiments myself and investing way too much time, I found the "jobstatus" package very promising (https://github.com/ropenscilabs/jobstatus). It relies on the "futures" package. Working with futures may explain the strange behaviour that you report. But of course it is not the intended behavior.

One of the major difficulties I had with the jobstatus package was to keep it as a suggested package, but to avoid importing it entirely when it is needed. An optional library call within a function causes a warning when running R CMD check. In the version you tried, I used attach to avoid library, but it does not seem to work. So I revert to library. With a very dirty hack, I can avoid the warning (do.call("library", list("jobstatus"))). So please get the latest development version from the dev branch. Hope it works for now, but is not a sound solution CRAN would accept.

Having the progress bar in combination with parallelization is the problem. If my solution does not work, let us give up parallelization/progress bar for the time being, and please run corenlp_parse_ndjson() with progress = FALSE, and from a terminal, not RStudio.

ablaette commented 3 years ago

All experiments with jobstatus / futures did not have a stable result. As we use Java multithreading, pursuing this trajectory is not necessary any more.