future.lapply does not gc() automatically

matthiasgomolka commented 6 years ago

I need to convert ~ 3000 large csv files to fst. To do so, I use the following function:

csv2fst <- function(file_path) {
  fst_path <- str_replace_all(file_path, "csv", "fst")
  f_data   <- fread(file_path, integer64 = "double")

  write_fst(f_data, path = fst_path, compress = 100)

  rm(f_data, fst_path)
  gc()
}

To speed things up, I use future_lapply(): future_lapply(files, csv2fst). Files is a character vector of file paths.

In principle, this works fine, but if I do not include the rm() and gc() lines, the memory usage just grows and grows (to 128GB) until Windows shuts down RStudio. When I include rm()and gc(), the memory usage is stable at around 23GB.

Whenever I read about gc(), it was stated that I don't have to care about it, since R calls gc() automatically in the background. Is it a bug that I do need to call it manually here?

I run R 3.5.0 on Windows 7 x64 and future.apply 1.0.0.

HenrikBengtsson commented 6 years ago

Short answer: Correct, there is nothing in the future framework that explicitly triggers the garbage collector to run. I'm handing all GC-related tasks over to the R engine to decide - as officially suggested. There is no bug involved here.

I don't know much how R's GC works internally, so I cannot say very much, other than ideally R should take of this, but evidently, it is not always working well. My best suggestion is to run gc() if you find it helping.

Long answer: Each R process/worker runs its own garbage collector. Ideally, we just let R decide on when to run the garbage collector, especially since running the garbage collector takes time. You'll find that calling gc() too frequently will slow down the overall performance. On the other hand, as you've discovered, it looks like R is not always that good at detecting when to run the GC resulting it unnecessarily high memory occupation. I don't know if this is more of a problem on Windows, or not. (When I was doing lots of long-running, large-scale analysis on Windows in the past, I often found myself calling gc() in the code to lower the memory footprint.). Adding to the above, the memory footprint scales with the number of parallel R processes you run. This means that R's "inefficiency" in running the GC is multiplied when running in parallel.

HenrikBengtsson commented 6 years ago

BTW, and unrelated to your issue, are you sure the CSV-to-FST conversion is limited by the CPU and not the disk I/O? Unless there's lots of expensive processing needed when converting from one format to another, my guess would be that the disk I/O is the bottleneck, and by parallelizing you'll just end up having multiple processing hitting the same disk, which might even slow down the overall processing time. Do you see a performance gain with 2, 3, ..., N parallel workers?

matthiasgomolka commented 6 years ago

Yes, I did a little benchmarking before and found the parallel version to be faster. Actually, I was surprised as well. I guess this mainly comes from the maximum compression when writing the fst files since fread already uses several cores.

HenrikBengtsson commented 6 years ago

Rereading your post, it turns out to me that you observed the memory issue when: using

csv2fst <- function(file_path) {
  fst_path <- str_replace_all(file_path, "csv", "fst")
  f_data   <- fread(file_path, integer64 = "double")

  write_fst(f_data, path = fst_path, compress = 100)
}

but you never really tried with:

csv2fst_rm <- function(file_path) {
  fst_path <- str_replace_all(file_path, "csv", "fst")
  f_data   <- fread(file_path, integer64 = "double")

  write_fst(f_data, path = fst_path, compress = 100)

  rm(f_data, fst_path)
}

When I first commented, I assumed you've tried the latter as well, but now I start believing that's not the case. The latter would also have worked, because it turns out that write_fst(x) returns x (*), which means that if you do:

y <- lapply(files, csv2fst)

you collect the data frames for all files, whereas with:

y <- lapply(files, csv2fst_rm)

you don't. Same with your csv2fst_gc workaround. But note, it's not gc() that helps you, it is the fact that you don't return what write_fst() returns. It also has nothing to do with future_lapply() per se.

A natural alternative is to return the name of the FST file produced:

csv2fst <- function(file_path) {
  fst_path <- str_replace_all(file_path, "csv", "fst")
  f_data   <- fread(file_path, integer64 = "double")
  write_fst(f_data, path = fst_path, compress = 100)
  fst_path
}

So, the general recommendation out there of not putting gc() in code still stands.

(*) This is actually documented in ?fst::write_fst but there's a typo in fst 0.8.8 - now fixed - blurring this fact.

futureverse / future.apply

future.lapply does not gc() automatically #22