Open Fogapod opened 4 months ago
Permanent clusters are far less common than ephemeral clusters, so I'm not surprised this hasn't come up before.
I would be happy to propogate other changes if that's valuable to you. Do you have any interest in raising a PR to handle this?
Alternatively you could just delete and recreate the resource?
Dropping daskcluster.kubernetes.dask.org/dask-primary
is what I do now. I am concerned about graceful shutdown because scheduler and workers might have pending tasks. Is there a way to do this?
When you delete the cluster all the Pods will be sent a SIGTERM
. At this point the Dask scheduler and workers should gracefully shutdown. If they take too long to shut down then Kubernetes will send a SIGTERM
, but this timeout is configurable via terminationGracePeriodSeconds
.
In a long lived deployment I expect you have some application that runs work on the Dask cluster. If the Dask cluster restarts without completing a computation then it should be the job of the application to resubmit the work.
I have a permanent dask cluster in kubernetes. Current operator ignores all changes to manifest. There has been an issue about supporting spec updates, it got closed as resolved after implementing scale field support: https://github.com/dask/dask-kubernetes/issues/636.
The only fields that cause changes to deployment after applying updated manifest are
spec.worker.replicas
,DaskAutoscaler
min/max.Is it possible to support other fields, specifically
image
,args
,env
,volumes/mounts
? If not, what could be the optimal way to gracefully shut down and update cluster?Cluster manifest (mostly copypasted from example):
Operator version:
helm install --repo https://helm.dask.org --create-namespace -n dask-operator --generate-name --version 2024.5.0 dask-kubernetes-operator
Dask version: custom built image that uses the following deps:Although it's the same with
2024.5.2-py3.11
image