dask / dask-gateway

A multi-tenant server for securely deploying and managing Dask clusters.
https://gateway.dask.org/
BSD 3-Clause "New" or "Revised" License
135 stars 87 forks source link

Cleanup k8s DaskCluster resources by introducing a `ttlSecondsAfterFinished` field respected by the controller? #760

Open consideRatio opened 9 months ago

consideRatio commented 9 months ago

When a k8s DaskCluster resource enters a "Stopped" state, for example by being idle_culled by the k8s DaskGateway controller, the k8s DaskCluster resource is still retained.

apiVersion: gateway.dask.org/v1alpha1
kind: DaskCluster
# ...
status:
  completionTime: "2023-10-25T11:43:39Z"
  credentials: dask-credentials-b3a990d302d84720aae27404f6153ade
  ingressroute: dask-b3a990d302d84720aae27404f6153ade
  ingressroutetcp: dask-b3a990d302d84720aae27404f6153ade
  phase: Stopped
  schedulerPod: dask-scheduler-b3a990d302d84720aae27404f6153ade
  service: dask-b3a990d302d84720aae27404f6153ade

Should a stopped DaskCluster resources get cleaned up directly, or after some time?

This is similar to having k8s Job resource creating a Pod to do some work. Then the Pod and Job is left in a "Completed" state a while. There is a topic about that.

When a Job completes, no more Pods are created, but the Pods are usually not deleted either. Keeping them around allows you to still view the logs of completed pods to check for errors, warnings, or other diagnostic output. The job object also remains after it is completed so that you can view its status. It is up to the user to delete old jobs after noting their status.

CronJob, that is a k8s resource to create Job resources, can cleanup the Job resources and it creates.

Finished Jobs are usually no longer needed in the system. Keeping them around in the system will put pressure on the API server. If the Jobs are managed directly by a higher level controller, such as CronJobs, the Jobs can be cleaned up by CronJobs based on the specified capacity-based cleanup policy.

It appears that in k8s 1.23+ (now probably used by most k8s clusters), there is a controller reading the k8s Job resource's ttlSecondsAfterFinished. I think it can make sense for the k8s dask-gateway resource controller to respect such configuration as well for the DaskCluster resources.