HenrikBengtsson / doFuture

:rocket: R package: doFuture - Use Foreach to Parallelize via Future Framework
https://doFuture.futureverse.org
84 stars 6 forks source link

.noexport argument is ignored #56

Closed koenniem closed 3 years ago

koenniem commented 3 years ago

As far as I can see, the .noexport argument of foreach::foreach is ignored by doFuture.

This issue was brought to my attention when it automatically tried to export a variable called data in my global environment to a function which uses a variable called data (but is not the same). While you could technically let it export it without serious consequences, this variable is 700Mb and is thus inefficient. Using the .noexport = "data" did not have any effect when using the doFuture adapter, but does work with doParallel and doSNOW.

I have no clue how to fix it, but in the code of doFuture and globals I can see .noexport is taken as an argument but never used elsewhere.

HenrikBengtsson commented 3 years ago

Thanks for reporting. Can you please provide a minimal reproducible example for this that I can work with?

koenniem commented 3 years ago

Sure thing.

Consider we have two functions:

foreach_fun <- function() {

    foreach::foreach(i = 1:10, .noexport = "data") %dopar% {

        data <- letters[1:10]

        collapse(data)
    }
}

collapse <- function(data) {
    data <- paste0(data, collapse = ",")
    data
}

registerDoFuture()
plan(future::multisession)
foreach_fun()
plan(sequential)

And this works fine. Now, suppose we have a variable in our global environment called data for an entirely unrelated reason.

# Make a very large data frame, 763 MB in size
# Not used for anything
data <- as.data.frame(matrix(nrow = 1e6, ncol = 200))

If you run foreach_fun again, you get the following error

Error in getGlobalsAndPackages(expr, envir = globals_envir, globals = TRUE) : The total size of the 3 globals that need to be exported for the future expression (‘{; doFuture::registerDoFuture(); lapply(seq_along(...future.x_ii), FUN = function(jj) {; ...future.x_jj <- ...future.x_ii[[jj]]; i <- NULL; ...future.env <- environment(); ...; }, error = identity); }); }’) is 762.95 MiB. This exceeds the maximum allowed size of 500.00 MiB (option 'future.globals.maxSize'). There are three globals: ‘data’ (762.95 MiB of class ‘list’), ‘collapse’ (5.81 KiB of class ‘function’) and ‘...future.x_ii’ (0 bytes of class ‘NULL’).

For reasons that are beyond me, doFuture tries to export data as well because it sees that it is used by collapse, even though this is not the same variable of course. In the process, the .noexport argument is ignored even though it should prevent data from being exported. Moreover, export collapse explicitly using .export does not prevent this either.

doParallel and doSNOW

Two adapters, doParallel and doSNOW don't have this problem, probably because they look for variables to export automatically like doFuture does.

cl <- parallel::makeCluster(parallel::detectCores())
doParallel::registerDoParallel(cl)
# doSNOW::registerDoSNOW(cl) # also works

foreach_fun <- function() {

    foreach::foreach(i = 1:10, .export = "collapse") %dopar% {

        data <- letters[1:10]

        collapse(data)
    }
}

foreach_fun()

doParallel::stopImplicitCluster()
registerDoSEQ()
parallel::stopCluster(cl)

Why?

My intuition is that .noexport is ignored by doFuture. For example, in the function getGlobalsAndPackages_doFuture in both globals.R and doFuture.R, the .noexport function is taken as an argument but then nothing is done with it.

HenrikBengtsson commented 3 years ago

Thanks again for reporting on this. Indeed, the doFuture adapter completely ignored the .noexport argument. This has been fixed in the develop branch, cf. https://github.com/HenrikBengtsson/doFuture/commit/974e61fb5bcfefe933c9a7262c5160cb337b7f25. Until next release is on CRAN, you can grab the develop branch version using:

remotes::install_github("HenrikBengtsson/doFuture", ref="develop")

Now, it turns out there's another problem too unrelated to the doFuture package. The data object is incorrectly identified as a global variable, which it shouldn't. This needs to be fixed in the globals package - tracking it in https://github.com/HenrikBengtsson/globals/issues/69. When fixed there, you should no longer have to turn to .noexport as a workaround. The bug there is that despite argument data, the globals package still thinks data in the function body is a global variable;

collapse <- function(data) {
    data <- paste0(data, collapse = ",")
    data
}

If you switch to, say,

collapse <- function(data) {
    data2 <- paste0(data, collapse = ",")
    data2
}

then data is no longer picked up as a global variable and you don't need that .noexport = "data". There have been a few other globals reports related to this problem in the past.

HenrikBengtsson commented 3 years ago

FYI, globals 0.14.0 on CRAN. Fixes the problem where data is incorrectly picked up as a global and exported. So, with the new globals version there's no need for .noexport.

HenrikBengtsson commented 3 years ago

doFuture 0.11.0, which acknowledges argument .noexport, is now on CRAN.