DavisVaughan / furrr

Apply Mapping Functions in Parallel using Futures
https://furrr.futureverse.org/
Other
693 stars 39 forks source link

Identical RNG state for each task, despite setting different seeds in different tasks #265

Open wlandau opened 11 months ago

wlandau commented 11 months ago

From #251 and from vignette("parallel", package = "parallel"), I understand the desire to assign widely-spaced L'Ecuyer RNG streams to parallel workers. L'Ecuyer and parallel::nextRNGStream() minimize the risk of overlapping sequences. Unfortunately, for targets and tarchetypes, there is no such thing as a "next stream" because tasks run in a DAG instead of a linear sequence. In addition, targets has special responsibilities when it comes to reproducibility, so each target must have its own reproducible RNG state which does not depend on the parallel backend or the RNG state of another target. See https://github.com/ropensci/targets/issues/1139 and https://books.ropensci.org/targets/random.html.

All this is to say, my packages rely on the ability to call set.seed() from inside a task.

In https://github.com/ropensci/tarchetypes/discussions/156, @solmos noticed that in the context of tarchetypes, future_map() is forcing each task inside a worker to have the same RNG state. This is a different problem than #251 because set.seed() is called within the task. See below for a reprex. Oddly enough, the results are incorrect in my local R session but correct when I call it inside the actual reprex package.

library(future)
library(furrr)
plan(sequential)
future_map_dfr(
  .x = seq_len(2L),
  .f = function(seed) {
    set.seed(seed, kind = "Mersenne-Twister")
    tibble::tibble(
      user_seed = seed,
      rng_state = digest::digest(.Random.seed)
    )
  },
  .options = furrr::furrr_options(seed = NULL)
)
#> # A tibble: 2 × 2
#>   user_seed rng_state                       
#>       <int> <chr>                           
#> 1         1 f659f2cdd57f16488358b216b60e824a
#> 2         2 f659f2cdd57f16488358b216b60e824a

I would expect to see:

# A tibble: 2 × 2
  user_seed rng_state                       
     <int> <chr>                           
1         1 e0d3d05f1b721ce7d02aa5d9b936a60e
2         2 4487855c46190850f144982674465288

Session info:

R version 4.3.0 (2023-04-21)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.5.2

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Indiana/Indianapolis
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] purrr_1.0.2   furrr_0.3.1   future_1.33.0

loaded via a namespace (and not attached):
 [1] digest_0.6.33     R6_2.5.1          utf8_1.2.3        codetools_0.2-19  tidyselect_1.2.0 
 [6] magrittr_2.0.3    glue_1.6.2        tibble_3.2.1      parallel_4.3.0    pkgconfig_2.0.3  
[11] generics_0.1.3    dplyr_1.1.3       lifecycle_1.0.3   cli_3.6.1         fansi_1.0.4      
[16] parallelly_1.36.0 vctrs_0.6.3       compiler_4.3.0    globals_0.16.2    rstudioapi_0.15.0
[21] tools_4.3.0       listenv_0.9.0     pillar_1.9.0      rlang_1.1.1  
wlandau commented 11 months ago

FYI @HenrikBengtsson

HenrikBengtsson commented 11 months ago

"The winter is coming" ... or parallel RNG is really hard when it comes to cover all scenario. I've got a bit of a backlog and this is really hard, so I don't have time to dive into it right now. It could be a simple reason and fix for your case, or it could be something bigger that requires design changes.

FWIW, there's also https://github.com/HenrikBengtsson/future.apply/issues/108, which may or may not be related.

HenrikBengtsson commented 11 months ago

... there is no such thing as a "next stream" because tasks run in a DAG instead of a linear sequence.

There's a root in the DAG, correct? If so, could you define a unique walk-through of the DAG from the root and outwards? That would allow you to order the nodes in a deterministic way (as long as the DAG does not change). With that, you could generate a unique set of RNG streams for your DAG so that they are assigned to the nodes in a deterministic way. In this sense, lapply/map/foreach uses a linear DAG where "next" is obvious (but one could come up with other walk-throughs that would also work, e.g. reverse)

wlandau commented 11 months ago

Yes, igraph::too_sort() can easily create a sequence from a DAG.

To rephrase: the recursiveness of nextRNGStream() is in direct conflict with the most fundamental goals and guarantees of targets. targets is all about cache invalidation and computational reproducibility. If streams are recursive and one stream changes, the streams of half the DAG could change, and then all those targets would need to rerun.

Take a simple targets pipeline that begins with a dataset and produces a downstream summary:

graph LR
  data --> summary

The topological sort is trivial:

library(igraph)
graph <- graph_from_literal(data-+summary)
names(topo_sort(graph))
#> [1] "data"    "summary"

To use recursive L'Ecuyer streams, the stream of summary would be the nextRNGStream() of data. So far so good.

But then what if the user adds a new model target to the pipeline?

graph LR
  data --> model
  data --> summary

The topological sort changes:

graph <- graph_from_literal(data-+model, data-+summary)
names(topo_sort(graph))
#> [1] "data"    "model"   "summary"

If streams are assigned recursively in topological order, then summary gets the nextRNGStream() of model, not data. The initial RNG state of summary changes, and the target needs to rerun. This is counterintuitive and inefficient because summary does not actually depend on the results of model.

In addition, there are two DAGs now: an explicit DAG for the intended dependency relationships and an implicit DAG for the extra dependency relationships induced by the RNG streams. targets would need to mash together both sets of edges like this:

graph LR
  data --> model
  data --> summary
  model --> summary

In the general case, even medium-sized DAGs would contort into bizarre, unpredictable, disruptive abominations.

To prevent the whole paradigm of targets from breaking down, the package uses the behaviors documented in https://books.ropensci.org/targets/random.html: the seed of each target is a deterministic function of its name and a fixed global seed. And at the end of that chapter, I explain a way (suggested by @shikokuchuo) to measure overlap empirically.

If at some point there is a way to generate safer deterministic seeds independently of one another, I will switch targets to that.

wlandau commented 11 months ago

But the original issue in this thread is more serious: different tasks in the same future_map() somehow end up with the same seeds.

wlandau commented 11 months ago

Just realized https://github.com/DavisVaughan/furrr/issues/265#issuecomment-1731829172 had typos in key places. Now fixed.

wlandau commented 11 months ago

A bit more context for others who might jump in: targets is essentially GNU Make for R. When you compile a project in Make, if the time stamp of a source file is earlier than that of its object file, then Make skips the target. targets does the same thing, but for data analysis. To decide whether to skip or rerun a task, targets looks at the R code, the input data, the output data, other settings, and the RNG seed. If any of these things change, the target reruns. Otherwise, the pipeline skips the target to save time. So it's really important that a target be able to choose an RNG state in a permanent and independent way. The problem with recursive L'Ecuyer streams is that the RNG state would have depend on some kind of ordering. No matter what that ordering turns out to be initially, it will have to change when the pipeline changes. And when that happens, the target-specific RNG streams change, which invalidates tasks that would otherwise be up to date.

HenrikBengtsson commented 11 months ago

(sorry @DavisVaughan for adding noise here; @wlandau , feel free to move this over to another issue of yours, if you think there's a better place to discuss this)

I might miss something, but the idea that we use for map-reduce calls in Futureverse is to pre-generate the RNG streams for all "tasks". This is expensive for numerous tasks, but I don't think there's another way to achieve this.

Here's the gist:

## Imaginary tasks
X <- 1:20

## Tasks are processed in random order.
## Can also skip already done tasks.
## Result will be the same regardless.
idxs <- sample.int(length(X), size = length(X))

## Pre-generate deterministic RNG streams for _all_ tasks
RNGkind("L'Ecuyer-CMRG")
set.seed(42)
seeds <- list()
seeds[[1]] <- get(".Random.seed", envir = globalenv(), inherits = FALSE)
for (kk in 2:length(X)) seeds[[kk]] <- parallel::nextRNGStream(seeds[[kk-1]])

## Process tasks order give above with fully deterministic RNG seeds
y <- rep(NA_real_, times = length(X))
for (kk in idxs) {
  ## Use deterministic RNG stream for this task
  seed <- seeds[[kk]]
  assign(".Random.seed", value = seed, envir = globalenv(), inherits = FALSE)

  y[kk] <- rnorm(n = 1L)
}
wlandau commented 11 months ago

feel free to move this over to another issue of yours, if you think there's a better place to discuss this

Moved to https://github.com/ropensci/targets/issues/1139, starting with https://github.com/ropensci/targets/issues/1139#issuecomment-1732284698.

wlandau commented 11 months ago

Moved to https://github.com/ropensci/targets/issues/1139, starting with https://github.com/ropensci/targets/issues/1139#issuecomment-1732284698.

At least the part about why targets is different. Keeping open the original issue about seeds in the same future_map() call.