Open jrevels opened 3 years ago
I can see this being useful to better control pod scaling if other ways are not adequate, but as far as resilience / migrating workers from failed nodes, the Distributed API isn't really set up to make use of workers being dynamically respun in the middle of say a pmap
afaict. I'm guessing to make use of worker resilience we'd need to write downstream code significantly differently, with resilience in mind all the way down. Although that could just look like a custom version of pmap
and similar, which is not that bad I guess.
right, doing this would not resolve any "application layer" bottlenecks to resiliency, but would improve things at the "orchestration layer" which is at least a prereq
This is a bit of an aside perhaps, but
the Distributed API isn't really set up to make use of workers being dynamically respun in the middle of say a pmap afaict
I think this might actually be OK-- looking at the code, pmap
works on an (Abstract)WorkerPool
and it take!
s a worker from the pool when it needs one. So I think you can dynamically add workers to the pool and it will grab them too.
IIUC K8sClusterManagers currently
kubectl create
s individual pods whenever a worker is added.I think it would make more sense from a K8s perspective to associate an actual pod controller with the "Julia cluster instance", and then update the underlying set of pods that back the Julia cluster instance via
kubectl apply
ing updates to that controller. This is philosophically more aligned with the way sets of pods are intended to be orchestrated in K8s land AFAIU, and would hopefully enable some configuration/resilience possibilities at the level of the whole cluster-instance and not just at the pod level (e.g. tuning pod scaling behaviors for the cluster instance, migrating workers from failed nodes, etc.)IIRC the driver Julia process is already backed by a
Job
so maybe that'd be sufficient? I thinkStatefulSet
is worth considering too. We should tour the built-in controllers and see which ones might make the most sense; especially w.r.t. being able to tell the controller to make additional pods available w/o interfering w/ existing onesThis would probably be massive overkill, but if none of the built-in K8s controllers are sufficient (which I guess is a possibility), you could even imagine a custom K8s operator implemented specifically for this purpose (
JuliaClusterInstance
)