Open DarwinAwardWinner opened 1 year ago
Related question: is there an easy way to determine whether a particular BiocParallel param will use multiple hosts (or whether it at least has the potential to do so)?
Also, I just remembered that even base R's parallel package (and therefore SnowParam) has the capability to run on multiple hosts using ssh.
SharedObject is designed for a single-machine case. It is possible to support multiple machines, however, this cannot be done without changing the BiocParallel core function, which is unlikely to happen very soon (I have a plan but it will take time to work on it)
For a short-term solution, we can let users decide whether to share an object or not. Usually, they are the best person to determine how to play with SharedObject. We can make our code smarter for sure, but some advanced features cannot work well without knowing the context that the parallel code is dealing with(e.g. disable copy-on-write)
It is possible to determine whether a parallel backend has multiple hosts. In fact, we can do more than that. We can determine whether workers can access the master's memory (they are on the same machine does not necessarily imply that the workers can access the shared object). See code below
library(BiocParallel)
library(SharedObject)
nworks <- 2
param <- SnowParam(nworks)
id <- allocateSharedMemory(1)
bplapply(seq_len(nworks),
function(x)hasSharedMemory(id),
BPPARAM = param)
freeSharedMemory(id)
I see, so in principle one could test empirically whether a BiocParallelParam object can access the master process's shared memory. However, this isn't guaranteed to give you the right answer, since you can easily have a param that mixes local and non-local workers (I believe you even demonstrated this with RedisParam at Bioc 2023).
I'm mainly thinking about this in terms of package code that wants to use SharedObject with BiocParallel and allow the user to specify the BPPARAM object. In this case, it seems that there are 3 options, none of which seem appealing:
I guess the ideal case would be if shared objects could magically know whether the worker they're being sent to has access to the shared memory and un-share themselves before serializing if not. Perhaps BiocParallel itself has enough information at runtime to implement this logic?
As far as I know, neither SharedObject nor BiocParallel knows whether the workers are local or remote. It is the backend that knows the location of the worker, so we need to either define a new API to get the worker locations from the backend or define a new backend specifically for SharedObject. Adding this feature means we need to do a lot of work to make sure of the backward compatibility. It is possible, but it deserves at least a year of work unfortunately...
Suppose I want to use SharedObject with BiocParallel, e.g.:
Some BiocParallel params (e.g. batchtools, Redis, future) can potentially run on multiple machines. Will shared objects work in this case? My guess is no, because I can't imagine how this would work in the general case. Perhaps they might be smart enough under the hood to make this work for the read-only case by copying the value to the workers.
Assuming the above code can't be guaranteed to work for all parallel backends, how can we practically use SharedObject with BiocParallel and other parallel frameworks with the potential to parallelize across multiple hosts? Do we need to check which parallel backend is in use and fall back to a separate non-SharedObject implementation if the backend is using multiple hosts?