Open mihaiconstantin opened 3 years ago
After some more debugging I think I found the commit that introduces the problem. Installing the package at this commit (i.e., b10799ffb8a801189e8cd3e7a090621d2016d817) and running the script.R
file above results in the .Random.seed
being created:
remotes::install_github("SachaEpskamp/bootnet", ref = "b10799ffb8a801189e8cd3e7a090621d2016d817")
At the previous commit (i.e., 93cdb315384652ed719adff8d1a27245b7e207ed) things seem to be fine. In fact, this seed is generated by attaching the snow
package, not bootnet
. So, this issue is sort of pointless but, before I close it, I am curious whether you see any issues with the behavior I indicated.
Hi Mihai,
Thanks for all this! So the issue lies with the snow package and not the bootnet package? I changed to using the snow package because the parallel computing led to crashes on Mac before I think. But snow is an old package, so I should indeed update this. Do you think it would be better to revert back to using the parallel package?
Best, Sacha
Hi Sacha,
Indeed, the issue is with the snow
package. More specifically, the RNG
is invoked inside the function initDefaultClusterOptions()
for setting a port number, and then this function is then called by .onLoad()
. I also asked what others think about this here.
The snow
package is quite old and I am a bit reluctant to use it... In my package, I use parallel
and I find it quite stable. To make my life easier, I built a lightweight wrapper around some of the APIs of parallel
that I needed.
The rationale behind this wrapper is to:
OS
.Random.seed
from the main process when forking).You can see the self-contained wrapper here, and here is a toy example of how I use it:
# Some variables.
data <- matrix(rnorm(9), 3, 3)
# Create backend instance.
backend <- Backend$new()
# Start the cluster.
# If the type is not provided, it is inferred based on the OS.
# The number of cores are selected s.t. at least one core is always left free.
# Upon creation the cluster is always cleared to ensure nothing unintentional is
# copied (e.g., when forking).
backend$start(cores = 2, type = "psock")
# Export variables to a cluster.
backend$export(variables = c("data"), environment = .GlobalEnv)
# Inspect what variables are on the cluster.
backend$inspect()
# Evaluate an arbitrary expression on the cluster.
backend$evaluate(expression = { data^2 })
# Clear the cluster.
backend$clear()
# To check that the cluster has been cleaned.
backend$inspect()
# Run tasks on the cluster in an `sapply` fashion.
backend$sapply(x = data[, 1], fun = function(x) { x^2 })
# Run tasks on the cluster in an `apply` fashion.
backend$apply(x = data, margin = 2, fun = function(x) { x^2 })
# Adopt a cluster that was created externally.
# It will fail if there is already an active cluster registered with the backend.
backend$adopt(cluster = parallel::makePSOCKcluster(2))
# Close it.
# If the cluster is not stopped, when the `backend` instance is removed during
# the garbage collection the cluster is also automatically stopped.
backend$stop()
# Try to adopt again now that it is close.
backend$adopt(cluster = parallel::makePSOCKcluster(2))
# Now the cluster type is switched from `psock` or `fork` to `adopted`.
backend$type
# Check that it also works with the adopted cluster.
backend$evaluate(expression = { rnorm(3) })
# The following fields can be accessed.
# Is there a an active cluster registered with the backend?
backend$active
# How many nodes?
backend$cores
# What type?
backend$type
# The `parallel` cluster object that can be used with the `parallel` functions.
backend$cluster
# Stop the cluster.
backend$stop()
# The fields are reset upon cluster stop.
backend$active
backend$cluster
backend$type
backend$cores
In my functions, I actually use it as follows:
# Simulate sequentially.
simulate <- function(data) {
sapply(data, function(x) { Sys.sleep(0.5); return(x^2) })
}
# Simulate in parallel.
simulate_parallel <- function(data, backend) {
backend$sapply(data, function(x) { Sys.sleep(0.5); return(x^2) })
}
# Let's say this is the exported function in the `NAMESPACE`.
simulation <- function(data, cores = NULL, backend_type = NULL) {
# Decide whether it is necessary to create a parallel backend.
use_backend <- !is.null(cores) && cores > 1
# Prepare backend if necessary.
if (use_backend) {
# Create backend instance.
backend <- Backend$new()
# Start it.
backend$start(cores, type = backend_type)
# Run the task.
result <- simulate_parallel(data, backend)
# Close the backend.
backend$stop()
# Otherwise just run the task sequentially.
} else {
result <- simulate(data)
}
return(result)
}
# Data.
set.seed(1)
data <- rnorm(10)
# Sequential.
simulation(data)
# Parallel.
simulation(data, 5)
You don't need the simulate()
and simulate_parallel()
functions to begin with. You could just replace, for instance, result <- simulate_parallel(data, backend)
with result <- backend$sapply(data, function(x) { Sys.sleep(0.5); return(x^2) })
.
But I like the simulate_parallel()
approach because it allows me to have separate implementations for the functions that can benefit from being made to run on a cluster. So I can gradually add these new implementations to my package and all I need is to pass a reference to the backend
object that I know how consume inside these functions (e.g., $inspect()
, $clear()
, $sapply()
etc.).
I hope this makes sense!
Hi Mihai,
Thank you for your insights here and sorry for the late reply. I notice indeed that now every time I start RStudio I get the object .Random.seed
which is weird because snow
shouldn't even be loaded and it doesn't happen running R from terminal.. RStudio is doing something weird there too, will look into this.
I am reluctant to change the dependency on snow
for now. I depended on parallel
before but it gave a lot of issues for some users at the time (I think Mac users couldn't use the package anymore). So I changed it in this commit. I will upload a version of bootnet to CRAN soon now so will keep snow
in that version, but maybe for the version after we can see if we can change it to parabar
which is your new package on this correct?
Hi Sacha,
I recall encountering an issue with parallel
on macOS
as well. In my search, I found this and this that lead to my question here, which, in turn, Henrik realized (i.e., here) that it was a bug in R
itself (i.e., filed and fixed here).
Long story short, the cluster was falling to create the worker processes when setup_strategy = "parallel"
, i.e., the default. With setup_strategy = "sequential"
it worked just fine, which is also what I see you are doing in commit https://github.com/SachaEpskamp/bootnet/commit/b10799ffb8a801189e8cd3e7a090621d2016d817:
My guess would be that parallel
will work well nowadays.
Indeed, we can try parabar
and leverage the progress tracking as well. For parabar
, I use, in fact, parallel
because I aimed to stay as close as possible to what ships with R
and reduce the number of dependencies. However, I went over the board with the tests to ensure the parallelization is properly tested (i.e., >98% code coverage). When we try out parabar
, we can also add a few tests to ensure the output is what we expect. And since these tests are automatically run by R CMD check
, this means we target different operating systems and R
builds (i.e., as of #100)—which should give us some peace of mind.
Hi Mihai, I see, thanks for the info! I already expected there was some bug in parallel
at that time, it was really weird that bootnet
just suddenly didn't work anymore on Mac... I will submit this version to CRAN now, and then we can include parabar in bootnet. I'll try to look at it in the coming weeks or end of summer.
Also, in parabar
the cluster gets cleared by design (i.e., here). So, no matter what other packages or IDEs decide that should be in the .GlobalEnv
of the worker nodes, that gets removed unless explicitly exported by the user.
Hi Mihai, I see, thanks for the info! I already expected there was some bug in parallel at that time, it was really weird that bootnet just suddenly didn't work anymore on Mac... I will submit this version to CRAN now, and then we can include parabar in bootnet. I'll try to look at it in the coming weeks or end of summer.
Sure thing! I can also, of course, help with that.
Note for clarity:
I am using
bootnet
in a package and when creating aPSOCK
cluster I noticed that the child processes are populated with a.Random.seed
.After a lot of debugging, I tracked this down to
bootnet
being attached to theR
session. Consider the following code in a file calledscript.R
:Running:
Rscript script.R
yields:It seems that after
bootnet
is loaded the.GlobalEnv
is polluted with a.Random.seed
object. I am not sure this is intentional and, if it is, whether invoking theRNG
at package load is sensible. In my particular case, this resulted in a hard to debug scenario when dealing with seeds onPSOCK
clusters.