beacon-biosignals / K8sClusterManagers.jl

A Julia cluster manager for Kubernetes
Other
31 stars 5 forks source link

pattern to `rmprocs` when there's no more work to do #87

Open ericphanson opened 3 years ago

ericphanson commented 3 years ago

This isn't strictly a K8sClusterManagers.jl issue, but @omus pointed me here :).

I was running hyperparameter optimization on a model using @phyperopt from Hyperopt.jl with pmap=Parallelism.robust_pmap from Parallelism.jl. I would spin up the desired number of workers with addprocs, then essentially call pmap via these abstractions, and then that's it. When the pmap is done, the manager writes out a summary and exits, and all the processors are released.

I wanted to train 20 models this way quickly, so I did this with 20 workers and left them to train. However, some finished much faster than others, and those processors were left idling. Since this is via k8s, if we killed them, we could have in-scaled and saved lots of resources.

It would be great to have something like pmap that could automatically remove processors when they were no longer needed.

ericphanson commented 3 years ago

Thinking about this slightly more, I think a nice "inversion of control" here is that the ideal pmap could return workers to the pool (in fact, I think it already does), and the pool could decide to remove idle workers. (Perhaps the pool would wait a minute or two and then if they are still idle, rm them).