[Discussion] Streaming results to Client as they become available

dask / distributed

A distributed task scheduler for Dask

https://distributed.dask.org

BSD 3-Clause "New" or "Revised" License

1.58k stars 717 forks source link

[Discussion] Streaming results to Client as they become available #4754

Open gjoseph92 opened 3 years ago

gjoseph92 commented 3 years ago

Sometimes, the wait between “all tasks are done on the dashboard” and “.compute() returns” can be long (minutes) when computing large arrays (~60GiB), even with a local cluster. (Yes, I’m definitely oversubscribing my laptop in that case, but I’ve still generally found this type of latency noticeable in less-extreme situations.)

What is the historical reasoning for running __dask_postcompute__ on the cluster and sending back one final concatenated result, instead of streaming each key/chunk back to the client as it completes, and doing the concatenation locally? If we were sending chunks back in parallel, I imagine we could hide that latency a bit, plus release some memory from workers sooner. Just wondering what the reasons are for not doing so.

mrocklin commented 3 years ago

There are a couple of reasons to leave it the way it is:

Resilience on the workers is already a solved problem. Bringing the client into this picture might complicate things.
The workers and client may have different environments.

Those aren't necessarily critical though. It's an option worth talking about. I will say that it doesn't seem as critical to me as some other changes do.

gjoseph92 commented 3 years ago

I agree that it's not the most important thing to focus on. I think it would mainly benefit:

Latency-sensitive applications (which, I would guess, are not the most common use case for dask)
Interesting architectures that could benefit from streaming chunks of results into another system (also an uncommon use case)
UX, as a small perceptual improvement that might make distributed feel snappier. The sort of latency you don't notice until it's not there.

Since we have other latencies that you certainly do notice, it seems low priority. I'm curious if it comes up for others though. Thanks for the explanation of the downsides; I think those are all surmountable, but may not be worth the effort currently.

mrocklin commented 3 years ago

For 2 it's also worth noting that one can do all of this stuff already on the client side using the various non-collections interfaces like futures. If folks want to get creative all of the tools are there :)

Option 3 sees like the key point to me here.