futureverse / future

:rocket: R package: future: Unified Parallel and Distributed Processing in R for Everyone
https://future.futureverse.org
955 stars 85 forks source link

Multisession Evaluation #206

Closed gzagatti closed 6 years ago

gzagatti commented 6 years ago

I am developing my own package which makes use of a future call. Inside the call, I use functions from the package I am developing, so something like this:

...
foo <- future::future({
    mypkg::myfun(x, y)
})

I call the function without appending mypkg::, which above is just to make it explicit. The program might call other functions that are located in other files of the package.

When running a multiprocess plan, the program runs without any problem and I get gains from parallelizing my process. However, when switching to multisession I get the message Error: there is no package called 'mypkg'.

What could I do to avoid this problem? I have tried passing packages = c("mypkg") to the function without success. When the function is executing it is of course aware of the environment and the functions in the package. But when another session is created with the multisession plan, it seems that the other sessions are not aware of the package. I had a similar problem with doParallel and could solve it there.

The package is not installed, since it is under development. I might be the case that it works fine once installed as a proper package.

My intentions is to make my package available to Windows users and thus I would like to use multisession planning.

HenrikBengtsson commented 6 years ago

Quick comment asking for clarification:

The package is not installed, since it is under development. ...

If it is not installed, how can it then be a package? Also, the error message: Error: there is no package called 'mypkg' suggests that the future framework did indeed identify a package named mypkg. Is mypkg listed in sessionInfo() on master or not? It sounds like you're using some special in-house tricks to "use a package without installing it".

FYI/details, using multiprocess on nix/macOS uses multicore and on Windows it uses multisession. So, the reason it works with multiprocess is because you're on either unix or macOS and therefore uses multicore, which in turn works because it uses forked processes. When using forked processes, all workers inherits everything from the master process.

gzagatti commented 6 years ago

I am developing a package roughly following Hadley's book. I have a folder which contains the package I am working on which is structured in the standard way.

When exploring the code I have written, I will load the package with devtools::load_all() which loads the package to the interactive R session I am currently running. When running the function in mypkg which calls future() with the multisession planning turned on, I get the error message.

Similarly, when I test the code using testthat::test()in the same interactive R session with scripts located intests/testthat`, I get the same error.

The execution stack where the error is raised which can be retrieved with options(error=recover), lists the following:

...
 8: value.ClusterFuture(X[[i]], ...)
 9: NextMethod("value")
10: value.Future(X[[i]], ...)

It seems that there is an attempt to extract the value of the future. Upon inspecting the future object, I found that it listed 4 packages: ('future', 'mypkg', 'stats', 'utils'). I am not sure where that information is coming from. But my intuition is that it comes from the environment where the function was called.

I think that devtools::load_all and testthat::test simulates the rough operation of loading an installed package without having it installed. Ideally, a call to future should copy the same searchpath as in the master to avoid such types of conflicts. Just passing the package name is not sufficient in this case.

Finally, thanks for clarifying the difference between multiprocess and multisession. The nice thing about multiprocess is exactly that the whole process is copied such that the environment and searchpath are preserved.

HenrikBengtsson commented 6 years ago

Is there a reason why you don't want to install the package? Installing the package would most likely solve the problem.

I'm pretty certain that we do not want future to emulate devtools and testthat:s emulation of how base R builds, installs, and checks packages. devtools and testthat have their pros and cons and you might be hitting one of the cons here.

OTH, I can imagine a similar scenario using only base R. That would basically be when you R CMD build PkgA a package and then R CMD check PkgA it without installing it (e.g R CMD INSTALL PkgA) it. That would work for any package that does not call its own function in an external R process. If the package relies on itself in another R process/session, then it must be installed, i.e. be in the library path. This is true for all parallel frameworks running R in a background process, e.g. PSOCK clusters of parallel (= multisession/cluster in future), callr, batchtools, ...

HenrikBengtsson commented 6 years ago

Upon inspecting the future object, I found that it listed 4 packages: ('future', 'mypkg', 'stats', 'utils'). I am not sure where that information is coming from. But my intuition is that it comes from the environment where the function was called.

That comes from static code inspection of the future expression before it is launched. You can see this if you create a lazy future (which is not launched), e.g.

> library(matrixStats) # rowSds()
> library(future)
> plan(sequential)
> f <- future({ X <- matrix(rnorm(100), nrow = 10); rowSds(X) }, lazy = TRUE)
> f$packages
[1] "matrixStats" "stats"

This tells us that the future expression depends on those two packages; rnorm() is from stats. (BTW, anyone reading this, please don't rely in f$packages - it's an internal field than may change at any time)

The automatic identification of packages can be overridden using the packages argument - but note that this is independent of your problem.

gzagatti commented 6 years ago

Thanks for the very thorough comments. I am not very well acquainted to the ins and out of R, thus they are really helpful.

The reason I don't want to install the package, is that I am currently developing a library attached to a project I am working on. In fact the package is being developed inside the project's folder. Since this library has a lot of ad-hoc functionality I would not like to install it system-wise. I find it very convenient to load it via devtools::load_allin analytical scripts (eg Rmd) and to be able to test functionality added to the project as work goes on.

If I were to install the library, I don't know if it would become very inconvenient to work with it without having to re-install it after every single change. So I would be looking for something equivalent to python's pip install -e ..

Since the project is shared with different users (Windows, Mac, Linux), it would be convenient if all could load the package the same way.

HenrikBengtsson commented 6 years ago

I'd still argue that you should install the package and what you're asking for is does not really make sense. You basically asking for the following package test do "just work":

system2("Rscript", args = c("-e", shQuote("print(myfun)")))

where myfun is your function in your mypkg. My point is that, it is more or less impossible for that call to figure out what myfun is without either specifying mypkg::myfun or attaching the package as in:

system2("Rscript", args = c("-e", shQuote("library(mypkg); print(myfun)")))

Either way, mypkg needs to be installed.

  1. You can install packages to your local package library under your home directory. That way it won't affect anyone else. That is basically the default behavior of R, unless your installed R as yourself, or run it as admin/root.

  2. If you want to share your code with other users and you develop it as a package rather than as standalone scripts, then you should even more encourage that the package is installed. You can ask each user to install it to their own R package library, or you can install it to a site-wide package library (again, see ?libPaths). To me it does not make sense to ask users to use devtools::load_all() to use your code/package.

  3. If you don't like the above, I think your use case is better addressed by an update to devtools and testthat rather than future, because there is nothing specific to the future package in your development workflow. It applies to several other cases where a package needs to run in a standalone background process (such as future's multisession workers do it or the above example).

PS. In R, the term 'library' has a different meaning that 'package'. A library contains a set of packages. Use 'package' when in doubt.

gzagatti commented 6 years ago

Again thanks for clarifying the issue. It makes sense now. I definitely need to rethink the workflow. As I said, I was looking for something like what is available in python which is pip install -e . which install the package in editable mode, meaning that the source tree can be edited while under development. Otherwise, I am totally fine about installing the package once in production. It is just a hassle to develop and test a package if I have to re-install the package for every change in the codebase, though this has nothing to do with future. I am happy to close this issue after this fruitful discussion. I will follow up with the developers of devtools and testthat for their views on that.

HenrikBengtsson commented 6 years ago

You can also temporarily add a devel library path to .libPaths() and install you work-in-progress package there such that when that path is removed your package is no longer available in other workflows. However, then you need to figure out how to get background sessions to have that same .libPaths() setup as master. This is where I think it makes most sense to add such features to devtools.

Cheers. Over and out...