Closed ChristophLeonhardt closed 3 years ago
Having a progress bar and parallelization at the same time is a very tricky operation. After implementing several experiments myself and investing way too much time, I found the "jobstatus" package very promising (https://github.com/ropenscilabs/jobstatus). It relies on the "futures" package. Working with futures may explain the strange behaviour that you report. But of course it is not the intended behavior.
One of the major difficulties I had with the jobstatus package was to keep it as a suggested package, but to avoid importing it entirely when it is needed. An optional library
call within a function causes a warning when running R CMD check. In the version you tried, I used attach
to avoid library
, but it does not seem to work. So I revert to library
. With a very dirty hack, I can avoid the warning (do.call("library", list("jobstatus"))
). So please get the latest development version from the dev branch. Hope it works for now, but is not a sound solution CRAN would accept.
Having the progress bar in combination with parallelization is the problem. If my solution does not work, let us give up parallelization/progress bar for the time being, and please run corenlp_parse_ndjson()
with progress = FALSE
, and from a terminal, not RStudio.
All experiments with jobstatus / futures did not have a stable result. As we use Java multithreading, pursuing this trajectory is not necessary any more.
tsv_files_tagged <- corenlp_parse_ndjson( input = ndjson_files, cols_to_keep = c(„id“, p_attrs), output = tsv_files_tagged, threads = no_cores, byline = TRUE, progress = TRUE, verbose = TRUE # this doesn't change anything )
Parsing 10 ndjson-files with a size of 10 gigabytes each, this code doesn't produce any output, neither in the console nor in R Studio. After seven minutes of running, no output file was produced either when running in R Studio.