DavisVaughan / furrr

Apply Mapping Functions in Parallel using Futures
https://furrr.futureverse.org/
Other
695 stars 39 forks source link

mgcv::gam(~s(pc = object not found)) when future::plan(multisession, workers > 1) #256

Closed twest820 closed 1 year ago

twest820 commented 1 year ago

I have a fit_gam() function I'd like to change from purrr::map() to furrr::future_map() in order to run rsample::vfold_cv() in parallel. Relying on map() is a major compute bottleneck and, for example, just going from one worker to two would reduce execution times this week by multiple days. As can be seen from the code link, it's impractical to reduce this to a repex, plus I don't have permission to disclose the data.

However, it appears future_map(.env_globals) behaves differently between single and multiple workers. With parallel use of future_map(), resolution of the pc argument in mgcv smooths (s(..., pc = )) fails in the code flow below even though both the smooth formula and the value of pc are present in the global environment. The same code runs fine with plan(workers = 1). I can reliably toggle between working and broken just by changing the number of workers.

plan(multisession, workers = 2)

fit_gam = function(formula, data)
{
  fitFunction = function(dataFold)
  {
    return(gam(formula = formula, data = analysis(dataFold)))
  }

  return(vfold_cv(data) %>% mutate(fit = future_map(splits, fitFunction)))
}

gamConstraint = c(x = 0)
fit_gam(..., y ~ s(x, pc = gamConstraint), data = dataTibble)
Error in `mutate()`:                                                                                                                    
ℹ In argument: `fit = future_map(splits, fitFunction)`.
Caused by error:
ℹ In index: 1.
Caused by error in `s()`:
! object 'gamConstraint' not found
Run `rlang::last_error()` to see where the error occurred.
> rlang::last_error()
<error/dplyr:::mutate_error>
Error in `mutate()`:
ℹ In argument: `fit = future_map(splits, fitFunction)`.
Caused by error:
ℹ In index: 1.
Caused by error in `s()`:
! object 'gamConstraint' not found
---
Backtrace:
  1. parallel (local) workRSOCK()
 26. base::eval(...)
 27. base::eval(...)
 30. purrr (local) `<fn>`(.x = `<list>`, .f = `<fn>`)
 31. purrr:::map_("list", .x, .f, ..., .progress = .progress)
 35. .f(.x[[i]], ...)
 36. ...furrr_fn(...)
 37. mgcv::gam(...)
 38. mgcv::interpret.gam(formula)
 39. mgcv:::interpret.gam0(gf, extra.special = extra.special)
 40. base::eval(parse(text = terms[i]), enclos = p.env, envir = mgcvns)
 41. base::eval(parse(text = terms[i]), enclos = p.env, envir = mgcvns)
 42. mgcv::s(...)

RStudio 2023.03.0, R 4.2.2, furrr 0.3.1, future 1.31.0, mgcv 1.8-41.

I've checked the obvious possible workaround of future_map(.env_global = parent.frame(n > 1)) for n = { 2, 3, 4, 10, 100 } and adjusting n has no effect. future_map(.env_global = .GlobalEnv) also has no effect. Since this is on Windows 10 plan(multicore) seems to be an alias for plan(sequential) and changing to plan(cluster) again has no effect. Neither do several flavors of nlme style hacks trying to use do.call() to force evaluation of gamConstraint when fit_gam() is called. Also haven't had luck with trying to flow gamConstraint via future_map2().

Is there something else I can attempt to induce future_map() to make gamConstraint available to multiple workers' mgcvns environments when mgcv is calling s() under eval()? From the testing I've done so far, hard coding the value of gamConstraint does seem to provide a workaround.

fit_gam(..., y ~ s(x, pc = c(x = -2)), data = dataTibble)

However, the actual constraint object is somewhat complex and recalculated semi-dynamically. So this manual option is tedious, fragile, and unattractive for code maintainability.

DavisVaughan commented 1 year ago

Tracking in https://github.com/HenrikBengtsson/globals/issues/87 instead