futureverse / future.apply

:rocket: R package: future.apply - Apply Function to Elements in Parallel using Futures
https://future.apply.futureverse.org
211 stars 16 forks source link

future.apply not working with fst package #23

Closed matthiasgomolka closed 6 years ago

matthiasgomolka commented 6 years ago

I have ~ 3000 fst files. These are organized as fst objects in the list fst_objs. I want to subset all of these objects using the following function:

filter_select <- function(fst_obj, filter, selection) {
  filter_eval <- eval(parse(text = filter)) 
  fst_obj[filter_eval, selection]
}

Using lapply(fst_objs, filter_select, filter, selection) where filter = 'fst_obj$INSTRUMENT == "DE0009652669"' and selection = 1:20 works fine and returns a list of small data.frames.

Replacing lapply() by future_lapply() returns Error in .subset2(x, i, exact = exact) : subscript out of bounds which is an error from fst().

I suspect this is related to the future package since a similar problem occurs with map() and future_map() from the furrr package. Parallel execution works with foreach.

HenrikBengtsson commented 6 years ago

My immediate guess is that the eval(parse( = ...)) code makes it very hard for the automatic identification of globals to work, but it's not unlikely that there's a simple fix/workaround. Could you please provide me with a minimal toy example where I can reproduce the above.

HenrikBengtsson commented 6 years ago

In this case it has nothing to do with the usage of eval(parse(...)). A minimal example is:

## Create a list of two fst_table objects - adopted from example("fst")
library(fst)
path <- paste0(tempfile(), ".fst")
write_fst(iris, path)
ft <- fst(path)
fts <- list(ft, ft)

foo <- function(x) {
  keep <- eval(parse(text = "x$Sepal.Length < 5"))
  x[keep, ]
}

# Works
y0 <- lapply(fts, FUN = foo)

# Fails
y1 <- future.apply::future_lapply(fts, FUN = foo)
### Error in .subset2(x, i, exact = exact) : subscript out of bounds

with traceback:

> traceback()
13: (function(x, i, exact) if (is.matrix(i)) as.matrix(x)[[i]] else .subset2(x, 
        i, exact = exact))(x, ..., exact = exact)
12: `[[.data.frame`(res$resTable, 1)
11: res$resTable[[1]]
10: data.table::setattr(res$resTable, "row.names", 1:length(res$resTable[[1]]))
9: read_fst(meta_info$path, j, old_format = .subset2(x, "old_format"))
8: `[.fst_table`(expr, keep)
7: expr[keep]
6: FUN(X[[i]], ...)
5: lapply(expr, FUN = findGlobals, envir = envir, ..., tweak = tweak, 
       dotdotdot = dotdotdot, substitute = FALSE, unlist = FALSE)
4: findGlobals(expr, envir = envir, ..., method = method, tweak = tweak, 
       substitute = FALSE, unlist = unlist)
3: globalsOf(expr, envir = envir, substitute = FALSE, tweak = tweak, 
       dotdotdot = "return", method = globals.method, unlist = TRUE, 
       mustExist = mustExist, recursive = TRUE)
2: getGlobalsAndPackages(X_ii, envir = envir, globals = TRUE)
1: future.apply::future_lapply(fts, FUN = foo)

This turns out to be a bug in the globals package (https://github.com/HenrikBengtsson/globals/issues/44). The future framework uses the globals package to identify global variables and packages that need to be exported and the latter currently chokes on fst::fst_table objects.

There's really nothing to fix in the future.apply package, but I'll keep this issue open until fixed in globals and verified that the above code snippet works here.

HenrikBengtsson commented 6 years ago

This has now been fixed in develop globals 0.12.1-9000. To install, use:

remotes::install_github("HenrikBengtsson/globals@develop")

Now, we get:

## Create a list of two fst_table objects - adopted from example("fst")
library(fst)
path <- paste0(tempfile(), ".fst")
write_fst(iris, path)
ft <- fst(path)
fts <- list(ft, ft)

foo <- function(x) {
  keep <- eval(parse(text = "x$Sepal.Length < 5"))
  x[keep, ]
}

# Works
y0 <- lapply(fts, FUN = foo)

# Fails
library(future.apply)
plan(multisession, workers = 2)
y1 <- future_lapply(fts, FUN = foo)
### Error in x[keep, ] : incorrect number of dimensions

This is a completely different error and indeed expected. What happens is that the future framework fails to identify that the 'fst' package needs to be loaded on the worker. This type of error is discussed in section 'Missing packages (false negatives)' of vignette 'A Future for R: Common Issues with Solutions'.

To workaround around this, we need to use:

y1 <- future_lapply(fts, FUN = foo, future.packages = "data.frame")

and we indeed have that:

stopifnot(identical(y1, y0))
HenrikBengtsson commented 6 years ago

FYI, globals 0.12.2 that fixes this problem is rolling out on CRAN right now.