HenrikBengtsson / future

:rocket: R package: future: Unified Parallel and Distributed Processing in R for Everyone
https://future.futureverse.org
951 stars 83 forks source link

DISCUSSION: Eager / lazy futures and synchronous / asynchronous futures #109

Open HenrikBengtsson opened 7 years ago

HenrikBengtsson commented 7 years ago

Here's how I think the Future API could be modified to meet the needs for distinguishing eager and lazy evaluation of futures.

Consistent API with option for lazy evaluation

This means all futures will be resolved eagerly by default. If a future should be resolved lazily, then it should be up to the one who writes the code (e.g. package or script developer) to decided, no the end user.

A trade-off of making everything being eagerly evaluated is that we can no longer guarantee that all futures are non-blocking. More precisely, if a sequential / non-asynchronous backend is set, then the future has to block when created (not when the value is requested if lazy evaluation is used). This also affects how plan(multisession) falls back to sequential processing when only a single core is available; previously it was falling back to plan(lazy).

Also, the existing names eager and lazy (as in plan(eager) and plan(lazy)) are unfortunate names for future strategies and should be deprecated if the above API is implemented. Thoughts:

Consistent API with option for synchronous evaluation

Is synchronous versus asynchronous more important? The above discussion addresses needs by developers who need to control whether futures are resolved eagerly or lazily. However, there could also be a need for controlling whether futures are resolved synchronously or asynchronously. To control the latter, we would also need to add something like:

I cannot see that anyone needs the asynchronous = FALSE case, but I can see that someone needs asynchronous = TRUE. Today, we have asynchronous = TRUE in most cases, but it cannot be guaranteed.

As mentioned above, there will be future backends where choice of arguments asynchronous and lazy will be impossible. For instance, what should happen when user sets:

plan(uniprocess)

and a package uses

f <- future(42, asynchronous = TRUE, lazy = FALSE)

? Should this give an error, or should asynchronous have higher priority than lazy or vice verse? In other words, if adding support to control for both asynchronous and lazy, then we introduce other "unknowns" in behavior. Maybe the rule would be that one can only specify one of asynchronous and lazy but not both.

Make uniprocess futures the exception?

The above open-ended questions are mostly due to the fact that we allow to resolve futures sequentially in the calling R session. It would probably be easier define the Future API more consistently if we would consider that as an outlier / exception to the normal use case. However, it is handy to use sequential processing by default, because it will always work. The second best is to use multisession futures by default, because that is very likely to work everywhere (modulo permission issues with opening local ports). Also, if one would make multisession futures the new default, then should it be using a single background R session as a worker or should it be to use all available cores?

See also

This issue tries to gather previous feedback and discussions in one place. For background, see:

petermeissner commented 7 years ago

The last week I have read a little about future and thought a lot about interface, especially good defaults. First of all I think that's non trivial and you have already done a lot of good structuring. Second, I will speak as a potential user not a CS expert or long time user.

So far I think one can see futures as

For me the the non-blocking aspect is the perspective that matters (i.e. doing more different tasks at the same time instead of doing the same stuff over and over again). Furthermore I would argue that future should take that perspective as guideline since parallelization is served by many packages already while the non-blocking part has to be hacked together.

Given that I would argue furthermore for the following sequence of defaults:

These defaults should cover most use cases out of the box while providing sensible fallbacks for compatibility and still allowing to be overwritten by setting options explicitly by the user.

my 5 pence.

HenrikBengtsson commented 7 years ago

@petermeissner, thanks for these comments.

So, I realize that we cannot really guarantee non-blocking processing unless there are an infinite number of workers (e.g. a compute cluster queue). For instance, if we use plan(multiprocess, workers = 4L), it will block when we try to create the 5:th future and the previous four are still being resolved. The only way it would not block would be if they are also created lazily in those cases, but most people want eager parallel evaluation. The only work around for this is to have an internal queue of futures that will be resolved as resources gets available. However, going down that path is major work and basically risks reinventing job schedulers and / or BatchJobs.

So, I think it needs to be relaxed to be non-blocking unless all available workers as busy.

petermeissner commented 7 years ago

For me it seems ok to get blocking when no further resources can be used. You still get 'heterogenous' parallelization and the option of no blocking given that resources are available.

HenrikBengtsson commented 7 years ago

For the record / not to forget: there should be a simple option (i.e. asynchronous = TRUE above) to use asynchronous futures also on single-core machines, cf. Issue #115. It's possible today but it's a bit complicated.

HenrikBengtsson commented 7 years ago

cc/ @krlmlr, @michaelsbradleyjr, @thomasp85, @clarkfitzg, @petermeissner, since you've all shown interest in lazy = FALSE / TRUE in one way or the other. Please consider trying out the develop version:

source('http://callr.org/install#HenrikBengtsson/future@develop')

The develop branch now implements:

The lazy argument is now in full control of the developer (not the user; see below). I'm keeping defaults backward compatible (this avoids having to decide / discuss that at this point) but already now the developer can be explicit about lazy if that is critical.

It is possible to create any number of lazy future without blocking regardless of future plan / type of backend. For instance, one can do:

library("future")
plan(multisession, workers = 2L)
fs <- lapply(1:20, FUN = function(i) future(i, lazy = TRUE))
length(fs)
## [1] 20

With lazy=FALSE, the above would block about every second future to wait for one of the two active futures be resolved, and so on.

There are package tests for both lazy = TRUE / FALSE on all supported future strategies on all OSes.

It is not possible for the end user to control / override lazy via plan(), e.g.

> plan(multisession, lazy = FALSE)
Error in tweak.future(function (expr, envir = parent.frame(), substitute = TRUE,  : 
  Future argument 'lazy' must not be tweaked / set via plan()

(*) Actually, the default is lazy = NA right now, which equals lazy = FALSE in all cases, except when plan(lazy) is used for which it equals lazy = TRUE. This only for everything to be backward compatible at this point. The goal is to deprecated plan(lazy) - in order to do that I need to come up with a better name for plan(eager) - maybe plan(sequential).

thomasp85 commented 7 years ago

Just to clarify (haven't got access to a pc right now). If you create the futures directly using e.g. multiprocess these changes does not affect anything?

HenrikBengtsson commented 7 years ago

You can do multiprocess(..., lazy = FALSE / TRUE) too, e.g.

fs <- lapply(1:10, FUN = function(i) multiprocess(i, lazy = TRUE))

But, true, the introduction of argument lazy for all future constructor functions does not affect anything for multiprocess(). The only would be if the default for lazy would change one day, in case you'd need to be explicit about lazy = FALSE.

HenrikBengtsson commented 7 years ago

Should also say that all updates for next release should be backward compatible so nothing breaks downstreams, cf. https://github.com/HenrikBengtsson/future/blob/develop/revdep/README.md

clarkfitzg commented 7 years ago

I started playing with this today and like the new functionality where the person writing the %<-% can specify lazy = TRUE / FALSE. But what if it doesn't matter, or they're not sure? In this case it would be nice if the end user could pick the default value for lazy using something like plan(multisession, lazy = FALSE). Is this just something that would need to be implemented?

Other thoughts:

+1 for defaulting to lazy=FALSE with these changes. I would expect a script to generally execute faster if workers start earlier.

%lazy% TRUE feels a bit clunky. But if one can do all of:

HenrikBengtsson commented 7 years ago

@clarkfitzg, thanks for the feedback and thoughts.

The quick history behind future(..., lazy) is that developers identified a need where they need to be able to control this as developers and if the user would allow to control for it, the functionality would / could break. By introducing lazy = TRUE / FALSE and only allowing the developer to control it addresses that need.

This is also why plan() and tweak() actively checks and prevents any attempts for modifying the lazy argument. Without those checks, it would actually be possible for the user to do plan(multisession, lazy = TRUE) and thereby break the intentions of the developer.

Yes, I think lazy = FALSE being the default will be a much more common use case and therefore allow many more to just write:

f <- future({ ... })
v %<-% { ... }

which is why I believe this will remain the default behavior. If there's a big need / request for analogue "lazy-evaluation" alternatives, one could imagine a special API(*) for that too, e.g.

f <- lazy_future({ ... })
v %<~% { ... }

But, I don't think we need to rush that.

(*) Obvious shortcut candidates would of course be eager() and lazy() for future(..., lazy = FALSE / TRUE), but that certainly has to wait because those functions means something different right now. And I want to keep the API to a minimum.

Now to your thoughts:

Do you think there is a real need here? First of all, it would require the developer to actively support (and test) a case where eager or lazy evaluation doesn't matter and that the user would really care about / need to have such an option.

With the current proposal, it is actually possible for the develop to support this without further API changes in the future package. For example, s/he could write

f <- future({ ... }, lazy = getOption("mypkg.lazy", FALSE))
v %<-% { ... } %lazy% getOption("mypkg.lazy", FALSE)

and then the user can use

options(mypkg.lazy = TRUE)

if they want something else than the default.

The above is of course a bit tedious to code for the develop (though it would be more work to test it), it can be simplified as:

f <- my_future({ ... })
v %<-my% { ... }

where these are defined locally in the mypkg package only to be used internally by the developer of that package.

clarkfitzg commented 7 years ago

Keeping the API simple sounds great.

I was thinking of only modifying the lazy argument if it isn't explicitly set. For example, suppose authenticate() takes a few seconds and is always needed, but fetch_data() may not be, so one writes:

a %<-% authenticate() %lazy% FALSE
b %<-% fetch_data("mydata", auth = a)

Then call plan(multisession, lazy = TRUE) and it would run as if one had written:

a %<-% authenticate() %lazy% FALSE
b %<-% fetch_data("mydata", auth = a) %lazy% TRUE

This is more complex, but if the code is in a package then an end user won't necessarily be aware that it's happening. And developers are free to set it when it matters.

Here's a hypothetical use case expanding on the code above. Developer uses future to cache some large data sets. On a server they want to populate the cache so they use plan(multisession, lazy = FALSE). This then takes a long time to populate the cache which takes up most of the server's memory. But on their local machine they don't have that much memory and they just want to look at one thing, so they use plan(multisession, lazy = TRUE). There are many alternative ways to do this task, but this seems elegant.

I certainly wouldn't call this behavior a "need". As you point out, it can be done now. Rather it's a way of being more flexible, not choosing until you have to. I haven't thought all this through a great deal- I'll try to use future over the next couple weeks and maybe then have more ideas.

HenrikBengtsson commented 7 years ago

A possibility to support user's request for "use-lazy-if-possible", could be do support:

a %<-% authenticate() %lazy% FALSE
b %<-% fetch_data("mydata", auth = a) %lazy% NA

where NA falls back to a setting. Exactly how the user should specify this setting / option is not clear, (but it should not be plan(..., lazy=...)) - maybe another argument name. On the other hand, this use case might be so case specific that it's simply better to control it via a regular option, e.g. options(fetch.data.lazy = TRUE). But maybe a global NA covers lots of cases. I think this is a use case where "time will tell" and built-in support can be added later.

clarkfitzg commented 7 years ago

I think this is a use case where "time will tell" and built-in support can be added later.

Sounds good.

HenrikBengtsson commented 7 years ago

A heads up:

Already next release will have plan(lazy) and f <- lazy(...) deprecated with the suggestion to use f <- eager(..., lazy = TRUE) and v %<-% { ... } %lazy% TRUE.

I was thinking of renaming eager to sequential at the same time, but I think I'll wait with that for until the following release cycle. Then I'll probably also introduce, analogously to sequential, the parallel future strategy (which will bring multicore, multisession, multiprocess, cluster and remote under the same umbrella such that one can use parallel in all cases). By I want to try out and think about sequential / parallel a bit more before changing that. But not taking this move, the next release will add more features while keeping the number of "surprises" / changes small.

PeterVermont commented 7 years ago

I was experimenting with future(lazy=TRUE... and I found that I wanted/expected the future to begin evaluation when I called resolved on it. Just because it is lazy does not necessarily mean that I only want to be able to use blocking access...

HenrikBengtsson commented 7 years ago

Not sure I understand. Can you clarify what you're missing / what's not working. A teeny reproducible example with a few lines of code of clarifies things.

PeterVermont commented 7 years ago

In my gist FutureTaskProcessor.R the expected workflow is that some will call startAsyncTask and then later will processRunningTasks(wait=FALSE). However if the user makes their future with lazy=TRUE the task will never start since processRunningTasks will not call value() until resolved is true. Future does not begin the task when resolve is called so therefore it will never be started.

It seems to me that if the user calls resolved they are making it clear that they wish the item to be running...

HenrikBengtsson commented 7 years ago

Ok, I think I understand; you cannot count on using while(resolved(f)) { ... } with lazy futures because it does not trigger them to start - that's the issue correct?

PeterVermont commented 7 years ago

Yes!

HenrikBengtsson commented 7 years ago

I'm open to add missing features, while keeping the Future API at a minimum. In other words, I'm trying to hold back and to be very conservative before adding bells and whistles - it's easy to add but very hard to remove features.

I need to think more about this. What I'm not sure about is whether this is really needed. In order for a future to be lazy, you as a developer has to explicitly request this via lazy = TRUE in the code. A user cannot do this - a user only has control of the strategy used via plan() which gives an informative error if you try to pass lazy. So, when you say "if the user makes their future with lazy=TRUE" how is that happening? Is that because users can inject random code? If not, it seems like you as the developer have full control and have little need for resolved() (or?). If you use f <- future(..., lazy = TRUE) in your code, you already know it's lazy, and in this case, at what point other than calling value(f) do you want to "start" a lazy future since you don't want to start it from the beginning?

PeterVermont commented 7 years ago

I understand your desire to keep a clean API. In this case no additional functions would be required -- just a different behavior which is that calling resolved on an future which is lazy should initiate the task. Another way to think about it is that if they are calling resolved on a unstarted lazy task they are almost certainly making a mistake since it can never return true. There are three approaches for library developers when they detect a user error: 1) do nothing (current behavior) 2) throw an exception or at least a warning: warning("resolved called on lazy future object. This will always return false since the task has not started.") or 3) (my preference) support the user's inferred desire and start the task if needed.

The use case is less clear but here is one hypothetical: I wish to cue up a bunch of future tasks that are waiting on some other triggering action. For example Action A must be completed before Action B begins but it is convenient to create them both at the same time. When action A completes (which is discovered by some outside future polling task of calling resolved()) I would then wish to start action B asynchronously.

HenrikBengtsson commented 7 years ago

I see. So, one issue that complicates what the default should be is what happens with non-asynchronous backends / futures, e.g. plan(sequential) and f <- future(slow_fcn(), lazy = TRUE). If resolved(f) forces the lazy future to be resolved, then it will be blocking until the future is resolved.

Some thoughts: One could imagine an argument resolve = TRUE / FALSE for controlling this, but an explicit call as:

resolve(f, wait = FALSE)

may be clearer. The wait argument is not implemented yet; this feature is mentioned in the top comment. Would that fit your use case / needs?

PeterVermont commented 7 years ago

While doing resolve(f, wait = FALSE) would work it does not seem obvious to me that the wait argument will affect whether a lazy future is started or not. Basically, in my view resolve implicitly is always wait = FALSE so it I think saying adding a wait parameter that can be set to true for resolved just seems confusing and a departure of your desire of keeping the API clean and simple.

Lazy futures with sequential plans seem strange but I guess my own scenario of wanting to queue up a bunch of actions could be a rationale for supporting it.

So... I would suggest not adding a parameter to resolve -- it muddies the waters. If called on an unstarted lazy it should start it. If that happens to be with a sequential plan so be it -- it will wait -- not because you are explicitly waiting but simply it is running in the current process.