PolMine / bignlp

Tools to process large corpora line-by-line and in parallel mode
1 stars 1 forks source link

corenlp_annotate on logging branch without (or with irritating) progress indication #11

Closed ChristophLeonhardt closed 3 years ago

ChristophLeonhardt commented 4 years ago

If I understand correctly, you implemented a more robust parallelization workflow on the current "logging" branch by actively triggering the java garbage collection. On first glance, this seems to effectively limit memory usage per core which is good (here no guarantees just yet as the data I used wasn't that large). However, I noticed that the actual annotation step when used as follows in the console

ndjson_files <- corenlp_annotate(
  input = tsv_files,
  output = ndjson_files,
  threads = no_cores,
  byline = TRUE,
  method = "json",
  progress = FALSE
)

looks rather scary because the terminal "scrolls" through an infinitive number of invisible lines while calculating, looking like it crashed until the console prompt comes back after a while.

I didn't test progress = TRUE because this was a tricky option to use in earlier versions. But I think with progress = FALSE, the console output should be omitted somehow because this is rather irritating.

ablaette commented 3 years ago

Having moved to Java parallelization (javamultithreading branch), this is unlikely to persist as a problem.