HenrikBengtsson / future

:rocket: R package: future: Unified Parallel and Distributed Processing in R for Everyone
https://future.futureverse.org
946 stars 82 forks source link

`source()` not masking symbols #575

Closed renkun-ken closed 1 year ago

renkun-ken commented 2 years ago

source() does not seem to work consistently inside the future expression with workers=1 and workers>1 for non-function variables in the sourced file. Following is a minimal example:

test.R:

library(data.table)
library(future)

plan(multisession, workers = 1)
res <- future.apply::future_lapply(1:2, function(date) {
  source("./future-source1.R")
  .N
})
res

Following is future-source1.R where fread (previously a function data.table::fread) and .N (previously a symbol data.table::.N which is NULL) are redefined.

fread <- function(text = NULL) {
  "source works inside this future expression"
}

.N <- 100

Then running test.R produces the following result:

[[1]]
NULL

[[2]]
NULL

which indicates that source() does not work as expected since we still get data.table::.N instead of .N <- 100.

However, if we replace returning .N with a fread(...) call, then it works as expected:

[[1]]
[1] "source works inside this future expression"

[[2]]
[1] "source works inside this future expression"

If we replace the number of workers with more than 1, then source always works.

One walk around for now is to use source(..., local = TRUE) but it would definitely be nice to have this fixed for consistency. Also, using future.callr::callr as the backend works too.

renkun-ken commented 2 years ago

However, using future.callr::callr as the backend is not perfect. Although the following case shows that the .N is successfully masked in the future,

library(data.table)
library(future)
source("./future-source1.R")

plan(future.callr::callr, workers = 1, gc = TRUE, earlySignal = TRUE)
res <- future.apply::future_lapply(1:2, function(date) {
  .N
})

res
[[1]]
[1] 100

[[2]]
[1] 100

The following example still fails:

library(data.table)
library(future)
source("./future-source1.R")

plan(future.callr::callr, workers = 1, gc = TRUE, earlySignal = TRUE)
res <- future.apply::future_lapply(1:5, function(date) {
  address(date)
  .N
})

res
[[1]]
NULL

[[2]]
NULL

It seems as long as the future contains a function call from the package, in this case, data.table::address(), the .N will not be masked. This occurs no matter how many workers are specified.

HenrikBengtsson commented 2 years ago

The long story has to wait until after the holidays, but the short story:

source() is peculiar in the sense that it's default source(..., local = FALSE) because it evaluates in the global environments. Evaluating and assigning to the global environment is never a good idea, unless you call source() from the prompt, when you're basically already in the global environment. So, yes, I'd say, source(..., local = TRUE) is always better (regardless of parallel processing or not). I would not use local = FALSE. I'll add this to the to-do to document.

Also, plan(multisession, workers = 1) falls back to plan(sequential), which explains that difference. If you want a single background worker, plan(cluster, workers = 1) provides that.

HenrikBengtsson commented 2 years ago

For the other things, there might be some funny things going on with identification of globals; I'll look into that later.

HenrikBengtsson commented 1 year ago

I've finally got around to look at this more. The odd behavior is a combination of using the default source(..., local = FALSE) and the fact that the global environment affects parallel backends and sequential processing differently. The latter is on the roadmap to harmonize (it has to do with the long outstanding feature request for caching globals, sticky globals, and other things; each update moves one step closer to fix that).

So the following is as expected when using sequential processing:

library(data.table)
library(future)
plan(sequential)

cat(file = "./future-source1.R", ".N <- 100")

## Case 1: The future picks up data.table:::.N = NULL as a global
## find(".N")  => "package:data.table"
message("Case 1")
rm(list = c("fread", ".N"))
f <- future({
  str(.N)  # global .N
  .N
})
stopifnot(".N" %in% names(f$globals))
v <- value(f)
# NULL
stopifnot(is.null(v))

## Case 2: Similar to Case 1, but we now use source() with the
## default local = FALSE
message("Case 2")
rm(list = c("fread", ".N"))
f <- future({
  str(.N)  # global .N
  source("./future-source1.R")
  str(.N)  # still global .N, because source(... local = FALSE) 
  .N
})
stopifnot(".N" %in% names(f$globals))
v <- value(f)
# NULL
# NULL
stopifnot(is.null(v))

## Case 3: Similar to Case 2, but we now use source() with local = TRUE
message("Case 3")
rm(list = c("fread", ".N"))
f <- future({
  str(.N)  # global .N
  source("./future-source1.R", local = TRUE)
  str(.N)  # local .N, because source(... local = TRUE) 
  .N
})
stopifnot(".N" %in% names(f$globals))
v <- value(f)
# NULL
# num 100
stopifnot(v == 100)

However, what is odd is that if we take Case 2 and add a dummy fread (e.g. at the top of the future expression), it causes .N to be dropped from the future globals, e.g.

## Case 4: Similar to Case 1, but with a dummy global 'fread()'
message("Case 4")
rm(list = c("fread", ".N"))
f <- future({
  fread
  str(.N)  # Weird; .N is not part of the globals, but find(".N")
  .N
})
## stopifnot(".N" %in% names(f$globals))  ## ODD
v <- value(f)
# NULL
HenrikBengtsson commented 1 year ago

The recommendation of using source(..., local = TRUE) is now documented in https://future.futureverse.org/articles/future-4-issues.html#using-source-in-a-future

HenrikBengtsson commented 1 year ago

However, what is odd is that if we take Case 2 and add a dummy fread (e.g. at the top of the future expression), it causes .N to be dropped from the future globals, e.g.

## Case 4: Similar to Case 1, but with a dummy global 'fread()'
message("Case 4")
rm(list = c("fread", ".N"))
f <- future({
  fread
  str(.N)  # Weird; .N is not part of the globals, but find(".N")
  .N
})
## stopifnot(".N" %in% names(f$globals))  ## ODD
v <- value(f)
# NULL

Ah, slow morning and a slow brain here: Of course .N is not a global variable here, because it's an exported object of data.table and therefore not part of the globals;

library(data.table)
library(future)
f <- future::future({
  fread
  .N
})
message(sprintf("Globals: [n=%d] %s", length(f$globals), paste(sQuote(names(f$globals)), collapse = ", ")))
message(sprintf("Packages: [n=%d] %s", length(f$packages), paste(sQuote(f$packages), collapse = ", ")))
#> Globals: [n=0]
#> Packages: [n=1] 'data.table'

What's peculiar though is that we get:

library(data.table)
library(future)
f <- future::future({
  .N
})
message(sprintf("Globals: [n=%d] %s", length(f$globals), paste(sQuote(names(f$globals)), collapse = ", ")))
message(sprintf("Packages: [n=%d] %s", length(f$packages), paste(sQuote(f$packages), collapse = ", ")))
#> Globals: [n=1] '.N'
#> Packages: [n=0] 

This should be no different from the above case. There's something funny going on with future::getGlobalsAndPackages() in the second case:

> str(future::getGlobalsAndPackages(quote({ fread; .N }), globals = TRUE)[c("globals", "packages")])
List of 2
 $ globals : Named list()
  ..- attr(*, "where")= Named list()
  ..- attr(*, "class")= chr [1:3] "FutureGlobals" "Globals" "list"
  ..- attr(*, "resolved")= logi TRUE
  ..- attr(*, "total_size")= num 0
 $ packages: chr "data.table"

> str(future::getGlobalsAndPackages(quote({ .N }), globals = TRUE)[c("globals", "packages")])
List of 2
 $ globals :List of 1
  ..$ .N: NULL
  ..- attr(*, "where")=List of 1
  .. ..$ .N:<environment: R_EmptyEnv> 
  ..- attr(*, "class")= chr [1:3] "FutureGlobals" "Globals" "list"
  ..- attr(*, "resolved")= logi FALSE
  ..- attr(*, "total_size")= num 0
 $ packages: chr(0) 
HenrikBengtsson commented 1 year ago

The latter (https://github.com/HenrikBengtsson/future/issues/575#issuecomment-1229443818) was due to a bug in globals::packagesOf(), which has been fixed for the next release of globals.

Thanks for reporting on this.