MUCollective / multiverse

R package for creating explorable multiverse analysis
https://mucollective.github.io/multiverse/
GNU General Public License v3.0
62 stars 5 forks source link

Multiple core execution doesn't update environments #107

Closed markromanmiller closed 2 years ago

markromanmiller commented 2 years ago

It appears that when code is run across multiple cores, the .results object containing the universe environments aren't updated properly. I would expect both of these methods to have equivalent results:

library(tidyverse)
library(multiverse)
#> Loading required package: knitr
#> 
#> Attaching package: 'multiverse'
#> The following object is masked from 'package:tidyr':
#> 
#>     expand

M_cores_1 <- multiverse()

inside(M_cores_1, {
  variable_inside_env <- branch(
    var_num,
    "var1" ~ 1,
    "var2" ~ 2,
    "var3" ~ 3
  )
})

execute_multiverse(M_cores_1, cores = 1)

multiverse_results_1 <- expand(M_cores_1) %>%
  mutate(
    environment_variables = map_dbl(.results, ~length(ls(envir = .x))),
    environment_names = map_chr(.results, ~paste0(ls(envir = .x), collapse = ", "))
  ) %>%
  select(-.parameter_assignment, -.code)

print(multiverse_results_1)
#> # A tibble: 3 × 5
#>   .universe var_num .results environment_variables environment_names  
#>       <int> <chr>   <list>                   <dbl> <chr>              
#> 1         1 var1    <env>                        1 variable_inside_env
#> 2         2 var2    <env>                        1 variable_inside_env
#> 3         3 var3    <env>                        1 variable_inside_env

# Multiple cores

M_cores_2 <- multiverse()

inside(M_cores_2, {
  variable_inside_env <- branch(
    var_num,
    "var1" ~ 1,
    "var2" ~ 2,
    "var3" ~ 3
  )
})

execute_multiverse(M_cores_2, cores = 2)

multiverse_results_2 <- expand(M_cores_2) %>%
  mutate(
    environment_variables = map_dbl(.results, ~length(ls(envir = .x))),
    environment_names = map_chr(.results, ~paste0(ls(envir = .x), collapse = ", "))
  ) %>%
  select(-.parameter_assignment, -.code)

print(multiverse_results_2)
#> # A tibble: 3 × 5
#>   .universe var_num .results environment_variables environment_names    
#>       <int> <chr>   <list>                   <dbl> <chr>                
#> 1         1 var1    <env>                        1 "variable_inside_env"
#> 2         2 var2    <env>                        0 ""                   
#> 3         3 var3    <env>                        0 ""

Created on 2022-04-22 by the reprex package (v2.0.1)

I'm running R 4.1.2 with RStudio 2021.09.2 on Ubuntu 20.04 LTS with 11th Gen Intel® Core™ i7-1165G7 @ 2.80GHz × 8. I'm not sure if I'm missing any software libraries to enable this capability.

abhsarma commented 2 years ago

Thanks for pointing this out! Yes, they should have the same output. Seems like the multi-core instance is not executing all the universes (or is not executing in the correct environment). I'll take a look

abhsarma commented 2 years ago

This seems to be an issue with how we use environments and parallel::mcmapply, since the code works fine with both mapply and futures.apply::future_mapply I'm trying to figure out an alternative solution which supports multicores com, perhaps using the futures package, which has been a long standing discussion (#54 , #89 ) but this likely means that we can't use pbmcapply either --- I'll try to look for an alternate implementation of progress bars

abhsarma commented 2 years ago

Actually, this problem seems to exist for any multicore / multisession library. The problem probably lies somewhere in the use of environments in parallel, but I can't seem to figure out what it is...

markromanmiller commented 2 years ago

I'm going to hazard a guess that mc*apply functions are designed to return a value, not necessarily carry over the side-effects of running code - as how could one tell what those side-effects are?

One approach could be requiring the user to be specific about what objects they want to return - if mc*apply functions return one object per function, perhaps that can be the environment? I don't know, I'm spitballing here. I do currently expect to use cluster computing with multiverse in the next month or two, so I have some time to put into this feature if my need arises.

abhsarma commented 2 years ago

tl;dr your approach of rewriting the environments makes sense. I describe below *what I think* is going wrong but I'll see if @mjskay has any alternative suggestions


Interesting, so it seems like mc*apply functions does something weird with environments:

library(rlang)
library(purrr)

env_list = list(new.env(), new.env(), new.env(), new.env()) # creates four new environments, with the global env as the parent
code_list = list(expr({a = 111}), expr({b = 112}), expr({c = 113}), expr({d = 114})) # random code

res = mapply(eval, expr = code_list, envir = env_list) # executes the code in each environment

map(env_list, env_names) # returns the names of the variables defined in each environment
env_list_2 = list(new.env(), new.env(), new.env(), new.env())

res = mcmapply(eval, expr = code_list, envir = env_list_2)

map(env_list_2, env_names) # returns `character(0)`

On further inspection (based on the approach you described), it seems like mc*apply functions do not return the same environments that were initially used, but rather returns entirely new environments:

eval_in_env = function(c, e) {
  eval(expr = c, envir = e)
  e
}

env_mapply = mapply(eval_in_env, code_list, env_list)
map2(env_mapply, env_list, identical) # returns TRUE for all

env_mcmapply = mcmapply(eval_in_env, code_list, env_list)
map2(env_mcmapply, env_list, identical) # returns FALSE for all

This second issue is why the output differs, because the actual environments in which mc*apply is executing the code is not stored anywhere. This makes me wonder if we should just use mc*apply instead mapply (even for single core operations) and change how we deal with environments instead of having two separate pathways...

mjskay commented 2 years ago

Yeah, I don't know exactly how R environments work with threads or multiple processes, but I would guess that they can't be shared across them. So I would guess that the parallel versions of apply copy environment contents into a new environment on a separate thread or process and then copy results back upon completion. So they would not be able to directly modify environments in the original thread.

This second issue is why the output differs, because the actual environments in which mcapply is executing the code is not stored anywhere. This makes me wonder if we should just use mcapply instead mapply (even for single core operations) and change how we deal with environments instead of having two separate pathways...

Having a single pathway makes sense. Though, did we end up implementing the crazy tree-of-environments approach or not? Would that need to change for a multithreaded approach?

If we are going to change this around, I would suggest moving to {future} at the same time as this should make it easier for users doing this on a cluster with custom setups.

abhsarma commented 2 years ago

We actually have the tree-of-environments implemented (and I do remember the parallel apply functions working at some point in time), but I don’t think it should be an issue. I think simply changing things to creating environments on execution instead of creating them apriori makes sense here

I'll write some tests for checking parallel execution.