Allow the scheduler to dynamically add/remove workers

jpsamaroo commented 4 years ago

As discussed in #147 , it may benefit certain use cases to known when a worker is unused by Dagger entirely (specifically, no data cached on the worker) so that the worker can be removed from the Distributed pool.

jpsamaroo commented 3 years ago

Expanding on this, it would be great if the scheduler could dynamically add new workers via Distributed whenever it believes that having extra workers would help decrease total runtime of the currently-loaded DAG. The scheduler would call a user-defined function to add workers, which could call into a custom ClusterManager. We would want to be able to specify what kinds of nodes are available (what kinds of processors and how many per node) so that, for example, GPU-only tasks would always have GPUs available. This would be interesting for interactive uses on HPC clusters or for accessing cloud platforms.

@DrChainsaw I see from Discourse that this is probably something you'd be interested in.

DrChainsaw commented 3 years ago

This is something that would certainly come in handy for me! Let me know if you want me to test something out.

I do have a fear that I might have added some kind of seed of chaos here with #147 though. The day after #147 was merged there was this discussion in ClusterManagers and it seems like Distributed.jl is not designed for this type of dynamic usage (I felt a little bit like the intern who just pushed Integration Test Email #1 into production).

Or perhaps your proposed method will be more Distributed-friendly?

jpsamaroo commented 3 years ago

I think @vchuravy was pointing out that because Distributed was originally designed for HPC clusters where startup is all at once, not all cluster managers will handle this well, and that's to be expected. But that doesn't preclude Distributed from handling this properly for cluster managers that do support dynamic worker changes (such as the LocalManager and probably SSHManager). I don't see why Dagger shouldn't be able to rely on Distributed to support this in at least some cases.

DrChainsaw commented 3 years ago

I don't see why Dagger shouldn't be able to rely on Distributed to support this in at least some cases.

Alright, just wanted to point it out.

Oh, and in case the above was a polite request for a contribution I'd be happy to help, but I feel a bit insequre w.r.t how to make "it believes that having extra workers would help decrease total runtime".

Is there a straightforward way to do? I suppose one could just trigger when the scheduler hits the hook with scheduled tasks and no workers, or even just outsource everything to the user (e.g. here is the current state, do whatever you want).

jpsamaroo commented 3 years ago

Oh, and in case the above was a polite request for a contribution I'd be happy to help

Not necessarily, I'm happy to do it as well (and the logic for starting/stopping workers is pretty trivial since you already added the logic to handle that in the scheduler).

but I feel a bit insequre w.r.t how to make "it believes that having extra workers would help decrease total runtime". Is there a straightforward way to do? I suppose one could just trigger when the scheduler hits the hook with scheduled tasks and no workers, or even just outsource everything to the user (e.g. here is the current state, do whatever you want).

Yeah, that's the key thing to be determined. This is one of those features where it's probably best to let the user define when this should happen, but we could provide some default code for this (say, trigger when it's been X seconds without any scheduling progress, or if the estimated time to DAG completion is greater than X minutes).

DrChainsaw commented 3 years ago

This is one of those features where it's probably best to let the user define when this should happen, but we could provide some default code for this

Sounds like a reasonable approach to me. Don't hesitate to ping if there is anything added in #147 which is confusing or if there is something to try out.

kolia commented 3 years ago

What is the story around initial loading of code on newly spun up workers, pass in a quote with all your using Package commands to be evaled in the worker's Main?

jpsamaroo commented 3 years ago

Generally I use @everywhere using Package1, Package2, ..., which works fine. Distributed's code-loading story isn't great right now, but it's what we've got.

JuliaParallel / Dagger.jl

Allow the scheduler to dynamically add/remove workers #149