SachaEpskamp / bootnet

Bootstrap methods for various network estimation routines
30 stars 14 forks source link

Attaching `bootnet` generates a `.Random.seed` in the `.GlobalEnv` #82

Open mihaiconstantin opened 2 years ago

mihaiconstantin commented 2 years ago

I am using bootnet in a package and when creating a PSOCK cluster I noticed that the child processes are populated with a .Random.seed.

After a lot of debugging, I tracked this down to bootnet being attached to the R session. Consider the following code in a file called script.R:

# Expect global environment to be clean.
cat("Before `bootnet` load:", paste0(ls(all.names = TRUE)), "\n")

# Load `bootnet.`
library(bootnet)

# Expect global environment to remain clean.
cat("After `bootnet` load:", paste0(ls(all.names = TRUE)), "\n")

Running: Rscript script.R yields:

Before `bootnet` load:  
Loading required package: ggplot2
This is bootnet 1.5.1
For questions and issues, please see github.com/SachaEpskamp/bootnet.
After `bootnet` load: .Random.seed 

It seems that after bootnet is loaded the .GlobalEnv is polluted with a .Random.seed object. I am not sure this is intentional and, if it is, whether invoking the RNG at package load is sensible. In my particular case, this resulted in a hard to debug scenario when dealing with seeds on PSOCK clusters.

mihaiconstantin commented 2 years ago

After some more debugging I think I found the commit that introduces the problem. Installing the package at this commit (i.e., b10799ffb8a801189e8cd3e7a090621d2016d817) and running the script.R file above results in the .Random.seed being created:

remotes::install_github("SachaEpskamp/bootnet", ref = "b10799ffb8a801189e8cd3e7a090621d2016d817")

At the previous commit (i.e., 93cdb315384652ed719adff8d1a27245b7e207ed) things seem to be fine. In fact, this seed is generated by attaching the snow package, not bootnet. So, this issue is sort of pointless but, before I close it, I am curious whether you see any issues with the behavior I indicated.

SachaEpskamp commented 2 years ago

Hi Mihai,

Thanks for all this! So the issue lies with the snow package and not the bootnet package? I changed to using the snow package because the parallel computing led to crashes on Mac before I think. But snow is an old package, so I should indeed update this. Do you think it would be better to revert back to using the parallel package?

Best, Sacha

mihaiconstantin commented 2 years ago

Hi Sacha,

Indeed, the issue is with the snow package. More specifically, the RNG is invoked inside the function initDefaultClusterOptions() for setting a port number, and then this function is then called by .onLoad(). I also asked what others think about this here.

The snow package is quite old and I am a bit reluctant to use it... In my package, I use parallel and I find it quite stable. To make my life easier, I built a lightweight wrapper around some of the APIs of parallel that I needed.

The rationale behind this wrapper is to:

You can see the self-contained wrapper here, and here is a toy example of how I use it:

# Some variables.
data <- matrix(rnorm(9), 3, 3)

# Create backend instance.
backend <- Backend$new()

# Start the cluster.
# If the type is not provided, it is inferred based on the OS.
# The number of cores are selected s.t. at least one core is always left free.
# Upon creation the cluster is always cleared to ensure nothing unintentional is
# copied (e.g., when forking).
backend$start(cores = 2, type = "psock")

# Export variables to a cluster.
backend$export(variables = c("data"), environment = .GlobalEnv)

# Inspect what variables are on the cluster.
backend$inspect()

# Evaluate an arbitrary expression on the cluster.
backend$evaluate(expression = { data^2 })

# Clear the cluster.
backend$clear()

# To check that the cluster has been cleaned.
backend$inspect()

# Run tasks on the cluster in an `sapply` fashion.
backend$sapply(x = data[, 1], fun = function(x) { x^2 })

# Run tasks on the cluster in an `apply` fashion.
backend$apply(x = data, margin = 2, fun = function(x) { x^2 })

# Adopt a cluster that was created externally.
# It will fail if there is already an active cluster registered with the backend.
backend$adopt(cluster = parallel::makePSOCKcluster(2))

# Close it.
# If the cluster is not stopped, when the `backend` instance is removed during
# the garbage collection the cluster is also automatically stopped.
backend$stop()

# Try to adopt again now that it is close.
backend$adopt(cluster = parallel::makePSOCKcluster(2))

# Now the cluster type is switched from `psock` or `fork` to `adopted`.
backend$type

# Check that it also works with the adopted cluster.
backend$evaluate(expression = { rnorm(3) })

# The following fields can be accessed.

# Is there a an active cluster registered with the backend?
backend$active

# How many nodes?
backend$cores

# What type?
backend$type

# The `parallel` cluster object that can be used with the `parallel` functions.
backend$cluster

# Stop the cluster.
backend$stop()

# The fields are reset upon cluster stop.
backend$active
backend$cluster
backend$type
backend$cores

In my functions, I actually use it as follows:

# Simulate sequentially.
simulate <- function(data) {
    sapply(data, function(x) { Sys.sleep(0.5); return(x^2) })
}

# Simulate in parallel.
simulate_parallel <- function(data, backend) {
    backend$sapply(data, function(x) { Sys.sleep(0.5); return(x^2) })
}

# Let's say this is the exported function in the `NAMESPACE`.
simulation <- function(data, cores = NULL, backend_type = NULL) {
    # Decide whether it is necessary to create a parallel backend.
    use_backend <- !is.null(cores) && cores > 1

    # Prepare backend if necessary.
    if (use_backend) {
        # Create backend instance.
        backend <- Backend$new()

        # Start it.
        backend$start(cores, type = backend_type)

        # Run the task.
        result <- simulate_parallel(data, backend)

        # Close the backend.
        backend$stop()

    # Otherwise just run the task sequentially.
    } else {
        result <- simulate(data)
    }

    return(result)
}

# Data.
set.seed(1)
data <- rnorm(10)

# Sequential.
simulation(data)

# Parallel.
simulation(data, 5)

You don't need the simulate() and simulate_parallel() functions to begin with. You could just replace, for instance, result <- simulate_parallel(data, backend) with result <- backend$sapply(data, function(x) { Sys.sleep(0.5); return(x^2) }).

But I like the simulate_parallel() approach because it allows me to have separate implementations for the functions that can benefit from being made to run on a cluster. So I can gradually add these new implementations to my package and all I need is to pass a reference to the backend object that I know how consume inside these functions (e.g., $inspect(), $clear(), $sapply() etc.).

I hope this makes sense!

SachaEpskamp commented 1 year ago

Hi Mihai,

Thank you for your insights here and sorry for the late reply. I notice indeed that now every time I start RStudio I get the object .Random.seed which is weird because snow shouldn't even be loaded and it doesn't happen running R from terminal.. RStudio is doing something weird there too, will look into this.

I am reluctant to change the dependency on snow for now. I depended on parallel before but it gave a lot of issues for some users at the time (I think Mac users couldn't use the package anymore). So I changed it in this commit. I will upload a version of bootnet to CRAN soon now so will keep snow in that version, but maybe for the version after we can see if we can change it to parabar which is your new package on this correct?

mihaiconstantin commented 1 year ago

Hi Sacha,

I recall encountering an issue with parallel on macOS as well. In my search, I found this and this that lead to my question here, which, in turn, Henrik realized (i.e., here) that it was a bug in R itself (i.e., filed and fixed here).

Long story short, the cluster was falling to create the worker processes when setup_strategy = "parallel", i.e., the default. With setup_strategy = "sequential" it worked just fine, which is also what I see you are doing in commit https://github.com/SachaEpskamp/bootnet/commit/b10799ffb8a801189e8cd3e7a090621d2016d817:

https://github.com/SachaEpskamp/bootnet/blob/b10799ffb8a801189e8cd3e7a090621d2016d817/R/bootnet.R#L533-L536

My guess would be that parallel will work well nowadays.

Indeed, we can try parabar and leverage the progress tracking as well. For parabar, I use, in fact, parallel because I aimed to stay as close as possible to what ships with R and reduce the number of dependencies. However, I went over the board with the tests to ensure the parallelization is properly tested (i.e., >98% code coverage). When we try out parabar, we can also add a few tests to ensure the output is what we expect. And since these tests are automatically run by R CMD check, this means we target different operating systems and R builds (i.e., as of #100)—which should give us some peace of mind.

SachaEpskamp commented 1 year ago

Hi Mihai, I see, thanks for the info! I already expected there was some bug in parallel at that time, it was really weird that bootnet just suddenly didn't work anymore on Mac... I will submit this version to CRAN now, and then we can include parabar in bootnet. I'll try to look at it in the coming weeks or end of summer.

mihaiconstantin commented 1 year ago

Also, in parabar the cluster gets cleared by design (i.e., here). So, no matter what other packages or IDEs decide that should be in the .GlobalEnv of the worker nodes, that gets removed unless explicitly exported by the user.

Hi Mihai, I see, thanks for the info! I already expected there was some bug in parallel at that time, it was really weird that bootnet just suddenly didn't work anymore on Mac... I will submit this version to CRAN now, and then we can include parabar in bootnet. I'll try to look at it in the coming weeks or end of summer.

Sure thing! I can also, of course, help with that.

mihaiconstantin commented 1 week ago

Note for clarity: