HenrikBengtsson / future

:rocket: R package: future: Unified Parallel and Distributed Processing in R for Everyone
https://future.futureverse.org
956 stars 83 forks source link

Ignore future.globals.maxSize when plan = sequential #197

Open ck37 opened 6 years ago

ck37 commented 6 years ago

Hello,

Would it be possible to not restrict execution based on future.globals.maxSize when plan = sequential? I recently converted a method to support parallelization via futures, but if no parallelization is used (plan = sequential) users shouldn't have to increase future.globals.maxSize because no exporting of globals should happen. What do you think?

Thanks, Chris

HenrikBengtsson commented 6 years ago

It's not a bad feature request. I've thought about removing it in the past, but there's a conflict between this and the objective of having all futures behave as similar as ever possible.

The argument for removing the check when using plan(sequential), other than that it is "useless", is that when you write code using futures, it would be nice if 'sequential' works as if you didn't use futures at all. That minimizes the friction that the user experiences. I agree, it's not fun to have to instruct users to increase the limit via options() the first thing you do. Also, the default future.globals.maxSize is somewhat arbitrary - the "correct" limit depends on a lot of things.

The argument for keeping it is that the user gets the same experience regardless of what future backend is in use. In the same vain, this should also apply for false-negative globals, i.e. if plan(multisession) would fail due to a global variable was not picked up automatically, then plan(sequential) or plan(multicore) should give the same run-time error. I've got some ideas to achieve this, but it's tricky. In the long run, I think this strategy would avoid what we see with some other parallel code/framework in R, namely that package developers on Linux/macOS test only on forked processes (e.g. registerDoParallel(ncores)) where globals is not an issue and forget about other backends (and Windows users). There are quite a few packages using foreach() %dopar% ... out there that only works with forked processes but if they'd had been forced to test with, say, PSOCK processes, then they have gone the extra mile to make sure to properly get exported globals and packages right.

I'm not saying no to your proposal/idea, but I have to think about this one. Maybe there could be a strict = TRUE/FALSE argument/option for this. This options could be forced to strict = TRUE when running R CMD check which could then lower the risk for mistakes slipping through the cracks at the same time as end users don't have worry about it.

In the meanwhile, as a package developer (assuming that's what we're talking about here), you can temporarily disable it, or increase it by some factor, yourself in 'sequential' settings, e.g.

if (inherits(plan(), "sequential")) {
   oopts <- options(future.globals.maxSize = Inf)
   on.exit(options(oopts))
}

It's not beautiful, but is a workaround for now.

ck37 commented 6 years ago

Sounds fair to me, and thanks for that options snippet. Yes I'd lean towards the "strict" option personally to get a nice compromise.

It might also be helpful to tweak the error message that the future.globals.maxSize check reports, to clarify that it may not be relevant for a given backend. E.g. when using multicore parallelization on linux I had a personal belief that the exporting of globals should not be needed, based on prior experience, but the error message made me second guess that. I switched to lapply because I didn't have time to investigate if future was potentially erroneously exporting globals (dealing with a 45GB dataset when late on a tight deadline). Someone with less knowledge of the internals of backends would be even more confused.

HenrikBengtsson commented 6 years ago

Thanks for this feedback - it's helpful.

Additional alternatives (to strict) would be to allow such errors to be decommissioned to warnings (and clarifying that it may be an error with other backends), and what you may hint at, an option for backend developers to tweak the error message.

I switched to lapply because I didn't have time to investigate if future was potentially erroneously exporting globals (dealing with a 45GB dataset when late on a tight deadline). Someone with less knowledge of the internals of backends would be even more confused.

I actually did something similar myself working with a huge HiC data set and got these errors; I set the limit to +Inf to be able to keep working. In the end of the day, it turned out to be related to a bug causing the size of the globals to be hugely overestimated in some cases. (Maybe this was also what hit you). This was fixed in future 1.7.0.

UPDATE: linked to the wrong bug

HenrikBengtsson commented 6 years ago

I linked to the wrong bug/issue; should have been https://github.com/HenrikBengtsson/future/issues/176