quickly return WorkerPool, add workers later

kolia commented 3 years ago

Getting many requested pods can trigger scale-up which takes time.

Currently this is dealt with with a timeout; any requested pods that do not stand up and connect by the timeout are dropped, and launch returns with however many pods have come up by the timeout. This can be awkward.

An alternative way to deal with spin-up slowness is to return a WorkerPool quickly, maybe as soon as there is one worker connected, and continue adding workers to that pool after returning.

To be practical, this method should have a worker initialization hook, so that workers only join the workerpool after evaling some quoted code in Main, typically using Packages commands and other definitions.

omus commented 3 years ago

I like the concept of dynamically adding workers into a pool but unfortunately we're restricted by how Distributed.jl works at the moment. Attempting to do this with the existing Distributed.jl would definitely be painful and probably end up being very fragile.

The concept however could warrant an iteration to Distributed.jl which could start out as an external package. Some basic thoughts on what changes would be made to the existing Distributed.jl stdlib:

Update the cluster manager interface such that workers are added via a Channel instead of a Vector
addprocs returns a WorkerPool or something similar where workers can be added dynamically as workers report in
@everywhere calls are applied to workers in the WorkerPool as they report in. This means that workers can execute this logic at very different times. There my be implications of this I'm not considering.
Using pmap or @distributed can partition work based upon the expected size of the WorkerPool and start running immediately with the current set of workers available

ericphanson commented 3 years ago

Are you sure we can't dynamically add workers to a pool? It seems like pmap just take!s workers from the pool instead of predistributing the work upfront.

(Though a redesign does sound good too!)

omus commented 3 years ago

Are you sure we can't dynamically add workers to a pool?

Here's the code that calls the launch method defined by the Distributed interface:

https://github.com/JuliaLang/julia/blob/c2b4b382c11b5668cb9091138b1fa9178c47bff5/stdlib/Distributed/src/cluster.jl#L480-L499

You're expected to add workers to the launched vector and there's no way to pass back a WorkerPool. There may be a way of doing an unblocked launch and call setup_launched_worker within it but I'd expect you to run into strange corner cases.

If someone wants to look into this further that would be great. They may find something I've missed or at worst validate my assessment.

omus commented 3 years ago

I may have thought of a workaround to this problem. If we define an alternative addprocs function, maybe spawn, what we could do is internally is call addprocs asynchronously adding single worker at a time. This should allow us to immediately return a mutable Vector of worker IDs or possibly even a WorkerPool. Depending how the internals of WorkerPool and functions that use it we may be able to allocate work to workers as they come in.

beacon-biosignals / K8sClusterManagers.jl

quickly return WorkerPool, add workers later #83