Open kolia opened 3 years ago
I like the concept of dynamically adding workers into a pool but unfortunately we're restricted by how Distributed.jl works at the moment. Attempting to do this with the existing Distributed.jl would definitely be painful and probably end up being very fragile.
The concept however could warrant an iteration to Distributed.jl which could start out as an external package. Some basic thoughts on what changes would be made to the existing Distributed.jl stdlib:
Channel
instead of a Vector
addprocs
returns a WorkerPool
or something similar where workers can be added dynamically as workers report in @everywhere
calls are applied to workers in the WorkerPool
as they report in. This means that workers can execute this logic at very different times. There my be implications of this I'm not considering.pmap
or @distributed
can partition work based upon the expected size of the WorkerPool
and start running immediately with the current set of workers availableAre you sure we can't dynamically add workers to a pool? It seems like pmap
just take!
s workers from the pool instead of predistributing the work upfront.
(Though a redesign does sound good too!)
Are you sure we can't dynamically add workers to a pool?
Here's the code that calls the launch
method defined by the Distributed interface:
You're expected to add workers to the launched
vector and there's no way to pass back a WorkerPool
. There may be a way of doing an unblocked launch
and call setup_launched_worker
within it but I'd expect you to run into strange corner cases.
If someone wants to look into this further that would be great. They may find something I've missed or at worst validate my assessment.
I may have thought of a workaround to this problem. If we define an alternative addprocs
function, maybe spawn
, what we could do is internally is call addprocs
asynchronously adding single worker at a time. This should allow us to immediately return a mutable Vector
of worker IDs or possibly even a WorkerPool
. Depending how the internals of WorkerPool
and functions that use it we may be able to allocate work to workers as they come in.
Getting many requested pods can trigger scale-up which takes time.
Currently this is dealt with with a timeout; any requested pods that do not stand up and connect by the timeout are dropped, and
launch
returns with however many pods have come up by the timeout. This can be awkward.An alternative way to deal with spin-up slowness is to return a WorkerPool quickly, maybe as soon as there is one worker connected, and continue adding workers to that pool after returning.
To be practical, this method should have a worker initialization hook, so that workers only join the workerpool after
eval
ing some quoted code inMain
, typicallyusing Packages
commands and other definitions.