dask / dask-gateway

A multi-tenant server for securely deploying and managing Dask clusters.
https://gateway.dask.org/
BSD 3-Clause "New" or "Revised" License
136 stars 88 forks source link

Detail the log message when shutting down a cluster due to `idle_timeout` #759

Closed consideRatio closed 11 months ago

consideRatio commented 11 months ago

I think its the controllers job to shut down a DaskCluster that has reached its idle_timeout. At least I see the following in the logs then:

[I 2023-10-25 11:43:39.621 KubeController] Shutting down prod.b3a990d302d84720aae27404f6153ade

I think this sould be detailed to mention that the cluster is shut down due to idle_timeout similar to how for example jupyter_server stops idle kernels:

                self.log.warning(
                    "Culling '%s' kernel '%s' (%s) with %d connections due to %s seconds of inactivity.",
                    kernel.execution_state,
                    kernel.kernel_name,
                    kernel_id,
                    connections,
                    idle_duration,
                )

I think practically this maybe should be a separate log message, beyond the "Stopping" message, but I think it should be clarified somehow why the cluster was stopped if possible.

udeet27 commented 11 months ago

Hi I see here a message is already displayed when idle_timeout is exceeded. Do I need to implement something similar in some other file? Any guidance would be much appreciated as I'm not deeply familiar with the codebase.

consideRatio commented 11 months ago

@udeet27 I'm don't overview the code base so well either so I had to dig in myself to help, doing so I was left uncertain what to do - because this can't be fixed easily. In brief, there were the controller, the dask-gateway-server, and the dask-scheduler. The idle_timeout was logged by the scheduler, but communicated a shutdown to the dask-gateway-server, that made the controller do the job, but no information was passed from the scheduler about why the cluster was to be terminated. So, there is no way for the dask-gateway-server to convey that to the controller either etc.


Looking in this search I found this:

https://github.com/dask/dask-gateway/blob/5e5005afc9968107bc9b949867dc785eee4cccda/dask-gateway-server/dask_gateway_server/backends/kubernetes/controller.py#L1024-L1044

Okay hmm, it seems that this is how things work:

If a dask-cluster is created, its the dask-cluster's scheduler that is responsible for shutting down the cluster. So, the scheduler is logging that it is terminating the cluster its part of, and as part of that.


  1. A dask-gateway client somewhere asks the dask-gateway server to start a DaskCluster
  2. A dask cluster is created using a KubeBackend, that creates a k8s DaskCluster resource that is managed by a "controller" looking at DaskCluster resources
  3. The controller sees the DaskCluster resource and creates a dask cluster scheduler
  4. The dask cluster scheduler is monitoring its own activity, and ask the dask-gateway server to terminate the cluster the scheduler is managing when having idled for too long - when it does - it doesn't pass a reason or similar for terminating.
  5. The dask-gateway server receives the request to terminate the cluster, but doesn't understand its due to inactivity. The dask-gateway server makes the KubeBackend terminate the cluster, which it does by updating the DaskCluster k8s resources to "Stopped" I think
  6. The controller sees that the status update, and shuts down the scheduler and workers for the dask cluster.

consideRatio commented 11 months ago

I see no action point that seems reasonable to go for any more in this issue. It would be to provide a "reason" and propegate that from the scheduler, but that may be a bit too complicated and require touching a lot of things - so I don't think its worth doing.

I'll go for a close on this issue @udeet27, THANK YOU for initiating an investigation!! I'm sorry it was an issue that didn't turn out resolvable =/

udeet27 commented 11 months ago

Ohh wow. It's a lot more complicated than I initially anticipated. Thanks for the detailed explanation. I'll look into the other issues and see if I can contribute in them.