eriknw / dask-patternsearch

Scalable pattern search optimization with dask
BSD 3-Clause "New" or "Revised" License
21 stars 2 forks source link

Batch function applications for fast functions #1

Closed mrocklin closed 7 years ago

mrocklin commented 7 years ago

When the objective function is fast to run it might make sense to run several evaluations within a single task.

eriknw commented 7 years ago

Excellent suggestion. I'll do this. Thanks!

mrocklin commented 7 years ago

Apologies if you've already thought of this by the way. I'm just emitting thoughts from running through a trivial example.

On Mon, Apr 3, 2017 at 3:44 PM, Erik Welch notifications@github.com wrote:

Excellent suggestion. I'll do this. Thanks!

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/eriknw/dask-patternsearch/issues/1#issuecomment-291252178, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszMg4MoLBVAbrwtRTgKJvvR2EGhMWks5rsUwYgaJpZM4Mx-8F .

eriknw commented 7 years ago

Nope, I haven't thought of this yet. Thanks for taking dask-patternsearch for a test drive and providing feedback.

I expect it's not too much of a burden for the user to specify a batch size as a keyword argument. Were you thinking of being more clever and adapting the batch size based on some observed behavior?

Also, this worked on a trivial example for you? Hooray! You're bold ;)

mrocklin commented 7 years ago

I just ran your x.dot(x) example.

I was thinking of automated batching. For example rather than running the function you might run a function that maps the function over a few inputs and also times the computation.

def f(func, states):
    start = time()
    results = list(map(func, states))
    end = time()
    return (end - start) / len(results), results
eriknw commented 7 years ago

I see. This should be doable, although I'll hold off for a bit until we have better tests. How should we spell this? min_task_time=0.05 keyword argument, or should we also try to estimate the scheduler and communication overhead and choose the target task time as appropriate? Do you have any intuition or experience for what a good min_task_time would be?

It's also possible for function evaluations to take wildly different times. For example, a function may begin with if not is_feasible(x): return np.inf.

mrocklin commented 7 years ago

I agree that this is not pressing.

the ideal task duration depends on how many machines you're running. If you assume a conservative 1ms overhead per task and you have 1000 cores then you want tasks that will take 1s to run. You probably want a lower bound of something like 10ms. Roundtrip system latency is going to be about that anyway.

eriknw commented 7 years ago

Cool, thanks.

One use case I've thought about supporting is sub-task reuse for separable functions. For example, for

def f(w, x, y, z):
    return g(w, x) * h(y, z)

results for g and h could be reused. Due to the grid-like structure of how points are sampled with the stencil, the performance benefit could be significant. This should be easy to do with delayed.

I think batching tasks could complicate doing this.

mrocklin commented 7 years ago

@jcrist just went through optimizing this sort of problem.

Using Dask.delayed and depending on consistent hashing to remove duplicate work is definitely a solution, but you might do well to handle this by hand as well if the number of trials grows large enough for overhead to matter (although this is probably less of an issue in your case since everything is asynchronous).

eriknw commented 7 years ago

Thanks. I'll take a look at the work Jim did (and bother him if needed) when I get this far. You're right, though, I expect overhead to be less important due to being fully asynchronous. Less important doesn't mean not important though. One step at a time...

eriknw commented 7 years ago

I think I'm okay with requiring the user to manually specify the batch size. Doing so will exclude the possibility of sharing intermediate results as discussed above. The expectation when grouping is that the function calculates quickly.

I think we should also provide an option to group the trial points into a single array. I assume it's noticeably more efficient for the scheduler to handle a single array than N smaller arrays. The main use case to support, though, is a vectorized cost function, which can be orders of magnitude more efficient (especially if executed on GPUs). This fits nicely into our goal of performing "effective brute force optimization" (new tagline?).

eriknw commented 7 years ago

Done. The user must specify the batch size.