futureverse / future.apply

:rocket: R package: future.apply - Apply Function to Elements in Parallel using Futures
https://future.apply.futureverse.org
211 stars 16 forks source link

Possible bug in loading packages #42

Open arunsrinivasan opened 5 years ago

arunsrinivasan commented 5 years ago

This was a bit hard to track down... Here goes.

data.table and bit both have setattr functions. But the annoying thing about bit:setattr is that it returns NULL. data.table::setattr returns the input object with the attribute modified invisibly. So, if you were to write a code like:

y <- setattr(18004L, "class", "Date") # today's date
# [1] "2019-04-18"

(I'm not saying this is how one should go about it, but there are other cases where we need to set an attribute and assign the result to an object.)

In this case, depending on what setattr we've, it'll return the right expected result or NULL.

With this, consider this code:

require(parallel)
require(doSNOW)
require(foreach)
require(future)
require(future.apply)
require(data.table)
foo <- function(dt) {
  cat(sprintf("[%s] %s", Sys.getpid(), capture.output(environment(setattr))), sep="\n")
  setattr(dt, "key", "val")
}
nodes <- 5L
cl <- future::makeClusterPSOCK(nodes)
plan(cluster, workers=cl, persistent=TRUE)
dt <- data.table(x=1, y=2)

ans <- values(
  lapply(seq_len(nodes), function(node) {
    future({foo(dt)}, packages=c("bit64", "data.table"))
  })
)
# [15468] <environment: namespace:data.table>
# [9808] <environment: namespace:data.table>
# [15016] <environment: namespace:data.table>
# [23496] <environment: namespace:data.table>
# [22312] <environment: namespace:data.table>

I've created a data.table in the local environment (just 1 for simplicity) and am calling a function that sets the attribute in parallel. Of course this function is terribly simplified as well.

Now, the way this works (as expected), is to first load bit64 first and data.table next and then look for functions in foo and possibly load more packages and then run foo() (AFAICT).

And this works fine as you can see from the output. If you were to check ans, you'd get the data.table with their attributes set.


Restart session (IMPORTANT). Now, with everything else remaining intact, if instead of running values(...), I use future_lapply:

ans <- future_lapply(cl, function(node) foo(dt), future.packages=c("bit64", "data.table"))
> ans <- future_lapply(cl, function(node) foo(dt), future.packages=c("bit64", "data.table"))
# [11976] <environment: namespace:bit>
# [23212] <environment: namespace:bit>
# [11732] <environment: namespace:bit>
# [22964] <environment: namespace:bit>
# [5748] <environment: namespace:bit>
> ans
# [[1]]
# NULL
# 
# [[2]]
# NULL
# 
# [[3]]
# NULL
# 
# [[4]]
# NULL
# 
# [[5]]
# NULL

Note how setattr refers to bit package.. I think this is because the packages get loaded after assessing setattr is used in foo and data.table from local env has a function called setattr and therefore gets loaded first followed by all packages in future.packages?

HenrikBengtsson commented 5 years ago

Thxs and sorry for the slow reply.

Yes, I can see how this can happen for PSOCK clusters, e.g. plan(cluster, ...) and plan(multisession). The problem is, as you suggest, that the search() path is different in the worker compared to the main R session (e.g. with plan(sequential)). What complicates it, is that something the search() path gets updated automatically by some other parallel/future call, i.e. it's not always obvious how, why, and when.

To really fix this for this parallel backend, one would need to come up with a way for each future to re-arrange the search() path on the worker(s) to be the same as that of the main R process. That's an issue to be solved in the future package.

What I think you're also reporting here is an inconsistency between future.apply and rolling your own version via plain futures. If so, I'm not sure why there is a difference, but it could be that packages to be attached are also automatically identified and has higher precedence than those you specify manually. Maybe this inconsistency can/should be fixed.

HenrikBengtsson commented 5 years ago

PS. You wanna use library() - not require(). And I don't think your example need to attach none of parallel, doSNOW, and foreach.