Closed markromanmiller closed 2 years ago
Thanks for pointing this out! Yes, they should have the same output. Seems like the multi-core instance is not executing all the universes (or is not executing in the correct environment). I'll take a look
This seems to be an issue with how we use environments and parallel::mcmapply
, since the code works fine with both mapply
and futures.apply::future_mapply
I'm trying to figure out an alternative solution which supports multicores com, perhaps using the futures
package, which has been a long standing discussion (#54 , #89 ) but this likely means that we can't use pbmcapply
either --- I'll try to look for an alternate implementation of progress bars
Actually, this problem seems to exist for any multicore / multisession library. The problem probably lies somewhere in the use of environments in parallel, but I can't seem to figure out what it is...
I'm going to hazard a guess that mc*apply functions are designed to return a value, not necessarily carry over the side-effects of running code - as how could one tell what those side-effects are?
One approach could be requiring the user to be specific about what objects they want to return - if mc*apply functions return one object per function, perhaps that can be the environment? I don't know, I'm spitballing here. I do currently expect to use cluster computing with multiverse in the next month or two, so I have some time to put into this feature if my need arises.
tl;dr your approach of rewriting the environments makes sense. I describe below *what I think* is going wrong but I'll see if @mjskay has any alternative suggestions
Interesting, so it seems like mc*apply functions does something weird with environments:
library(rlang)
library(purrr)
env_list = list(new.env(), new.env(), new.env(), new.env()) # creates four new environments, with the global env as the parent
code_list = list(expr({a = 111}), expr({b = 112}), expr({c = 113}), expr({d = 114})) # random code
res = mapply(eval, expr = code_list, envir = env_list) # executes the code in each environment
map(env_list, env_names) # returns the names of the variables defined in each environment
env_list_2 = list(new.env(), new.env(), new.env(), new.env())
res = mcmapply(eval, expr = code_list, envir = env_list_2)
map(env_list_2, env_names) # returns `character(0)`
On further inspection (based on the approach you described), it seems like mc*apply functions do not return the same environments that were initially used, but rather returns entirely new environments:
eval_in_env = function(c, e) {
eval(expr = c, envir = e)
e
}
env_mapply = mapply(eval_in_env, code_list, env_list)
map2(env_mapply, env_list, identical) # returns TRUE for all
env_mcmapply = mcmapply(eval_in_env, code_list, env_list)
map2(env_mcmapply, env_list, identical) # returns FALSE for all
This second issue is why the output differs, because the actual environments in which mc*apply is executing the code is not stored anywhere. This makes me wonder if we should just use mc*apply instead mapply (even for single core operations) and change how we deal with environments instead of having two separate pathways...
Yeah, I don't know exactly how R environments work with threads or multiple processes, but I would guess that they can't be shared across them. So I would guess that the parallel versions of apply copy environment contents into a new environment on a separate thread or process and then copy results back upon completion. So they would not be able to directly modify environments in the original thread.
This second issue is why the output differs, because the actual environments in which mcapply is executing the code is not stored anywhere. This makes me wonder if we should just use mcapply instead mapply (even for single core operations) and change how we deal with environments instead of having two separate pathways...
Having a single pathway makes sense. Though, did we end up implementing the crazy tree-of-environments approach or not? Would that need to change for a multithreaded approach?
If we are going to change this around, I would suggest moving to {future} at the same time as this should make it easier for users doing this on a cluster with custom setups.
We actually have the tree-of-environments implemented (and I do remember the parallel apply functions working at some point in time), but I don’t think it should be an issue. I think simply changing things to creating environments on execution instead of creating them apriori makes sense here
I'll write some tests for checking parallel execution.
It appears that when code is run across multiple cores, the
.results
object containing the universe environments aren't updated properly. I would expect both of these methods to have equivalent results:Created on 2022-04-22 by the reprex package (v2.0.1)
I'm running R 4.1.2 with RStudio 2021.09.2 on Ubuntu 20.04 LTS with 11th Gen Intel® Core™ i7-1165G7 @ 2.80GHz × 8. I'm not sure if I'm missing any software libraries to enable this capability.