Bioconductor / basilisk

Clone of the Bioconductor repository for the basilisk package.
https://bioconductor.org/packages/devel/bioc/html/basilisk.html
GNU General Public License v3.0
27 stars 14 forks source link

load reticulate on fallback earlier and w/o env #30

Closed mbstadler closed 8 months ago

mbstadler commented 8 months ago

This is related to the bioc slack discussion starting with https://community-bioc.slack.com/archives/C056CEJTH5Z/p1698998329036819 from @csoneson

By loading reticulate on the fallback process earlier and using a function defined in .GlobalEnv, the fallback R process does not inherit the parent's reticulate namespace anymore, which otherwise leads to an error in Bioconductor 3.18 and onwards.

mtmorgan commented 8 months ago

Another, simpler, solution might be clusterEvalQ(), which evaluates an expression on the remote cluster. So

proc = makePSOCKcluster(1)
## local process id
Sys.getpid()
## remote process id
clusterEvalQ(proc, Sys.getpid())[[1]]

So just

clusterEvalQ(
    proc,
    library("reticulate", character.only=TRUE, lib.loc = file.path(R.home(), "library"))
)

There is a little subtlety involved here -- presumably R.home() is the home directory of the spawned process, and not of the parent process.

LTLA commented 8 months ago

I wonder what changed (in R?) given that it was working fine before.

At any rate: it seems that, to be completely safe, we would have to "sanitize" the environment for every function we send over to the fallback process. This isn't just limited to the library call, e.g., all assigner calls, the basilisk.utils:::.activate_condaenv call, the no-op function here, and the final calls to basiliskStart and .instantiate_store. So we would do something like:

        envstripper <- function(fun) {
             environment(fun) <- .GlobalEnv
             fun
        }

        clusterCall(proc, envstripper(function() {
            library("reticulate", character.only=TRUE, lib.loc = file.path(R.home(), "library"))
        }))

        assigner <- envstripper(function(name, value) assign(name, value, envir=.GlobalEnv))
        clusterCall(proc, assigner, name=".activate_condaenv", value=envstripper(basilisk.utils:::.activate_condaenv))
        clusterCall(proc, assigner, name="isWindows", value=isWindows)
        clusterCall(proc, envstripper(activateEnvironment), envpath=envpath, loc=getCondaDir())
        clusterCall(proc, assigner, name="activateEnvironment", value=list) # no-op as we already ran it.
        clusterCall(proc, envstripper(useBasiliskEnv), envpath=envpath)
        clusterCall(proc, envstripper(.instantiate_store))

Ideally, I'd have something that would basically go into useBasiliskEnv, resolve all calls to basilisk[.utils] utilities to base-only functions, and effectively "compile" it into a dependency-free function body that could be shipped to the child. Then I wouldn't need hard-coded knowledge of the internals needed to get useBasiliskEnv to work.

Until then, though, the environment stripping idea is probably good enough.

mbstadler commented 8 months ago

Yes, stripping the basilisk environment from all of these functions seems an even safer approach. I think that only the library call throws an error because the others do not require a package from a defined lib.loc and will be happy with the namespace that comes with the functions.

I have adapted the PR according to your code above.

LTLA commented 8 months ago

Alright, let's see how it goes on BioC-devel.