DavisVaughan / furrr

Apply Mapping Functions in Parallel using Futures
https://furrr.futureverse.org/
Other
695 stars 39 forks source link

Issue with indexing data.tables passed to future_map_* #182

Closed thiesben closed 3 years ago

thiesben commented 3 years ago

I've come across a bug when working with data.tables and furrr. Check out this reprex:

library(data.table)
library(furrr)

fun <- function(one, two){
  print(two)
  print(class(two)) # data.table data.frame
  # print(two[,a]) # <- Uncomment for "Error in `[.data.frame`(two, , a) : object 'a' not found"

  print(one)
  print(class(one))  # data.table data.frame
  # print(one[,y]) # same here

  # Now this:
  setDT(one)
  print(class(one)) # data.table data.frame
  print(one[,y]) # Prints correctly, no error!

  return(NULL)
}

df <- data.frame(x = c(1,2), y = c(1.2,3.4))
dt <- setDT(df)
dt[,y] # 1.2 3.4

input <- list(data.table(a = c(4211, 815)), data.table(a = c(007, 101)))

plan(multisession, workers = 2)
future_map(input, ~fun(dt, .x))

I'm encountering an error saying "Error in '[.data.frame'(two, , a) : object 'a' not found" when trying to access columns in the way done by the function in the example. However, when (redundantly!) calling setDT in the function, it works without problems. I really don't know where to address this, the behaviour is very weird.

Also, this does not only affect indexing with data.tables, but also filtering etc.

DavisVaughan commented 3 years ago

The issue here is the same as https://github.com/HenrikBengtsson/globals/issues/46 and won't be fixed by furrr.

The problem is that the underlying {globals} package that looks for globals and packages to "export" to your workers can't find anything that is specific to data table...until you call setDT(). It isn't the act of "setting" the object as a data table that fixes things. It's just the fact that that function is there, so now globals sees that data.table is a required package for that function to run.

The easiest way to fix this is to require data table to be loaded on the workers with furrr_options(packages = "data.table")

library(data.table)
library(furrr)

# nothing in here is "data.table specific"
fun1 <- function(x) {
  x[,y] 
}

fun2 <- function(x) {
  # do something stupid that clearly requires data table
  data.table(1)

  x[,y] 
}

df <- data.frame(x = c(1,2), y = c(1.2,3.4))
dt <- setDT(df)

lst <- list(dt)

plan(multisession, workers = 2)

future_map(lst, fun1)
#> Error in `[.data.frame`(x, , y): object 'y' not found

future_map(lst, fun2)
#> [[1]]
#> [1] 1.2 3.4

future_map(lst, fun1, .options = furrr_options(packages = "data.table"))
#> [[1]]
#> [1] 1.2 3.4