Closed consideRatio closed 11 months ago
Hi I see here a message is already displayed when idle_timeout
is exceeded. Do I need to implement something similar in some other file? Any guidance would be much appreciated as I'm not deeply familiar with the codebase.
@udeet27 I'm don't overview the code base so well either so I had to dig in myself to help, doing so I was left uncertain what to do - because this can't be fixed easily. In brief, there were the controller, the dask-gateway-server, and the dask-scheduler. The idle_timeout
was logged by the scheduler, but communicated a shutdown to the dask-gateway-server, that made the controller do the job, but no information was passed from the scheduler about why the cluster was to be terminated. So, there is no way for the dask-gateway-server to convey that to the controller either etc.
Looking in this search I found this:
Okay hmm, it seems that this is how things work:
If a dask-cluster is created, its the dask-cluster's scheduler that is responsible for shutting down the cluster. So, the scheduler is logging that it is terminating the cluster its part of, and as part of that.
I see no action point that seems reasonable to go for any more in this issue. It would be to provide a "reason" and propegate that from the scheduler, but that may be a bit too complicated and require touching a lot of things - so I don't think its worth doing.
I'll go for a close on this issue @udeet27, THANK YOU for initiating an investigation!! I'm sorry it was an issue that didn't turn out resolvable =/
Ohh wow. It's a lot more complicated than I initially anticipated. Thanks for the detailed explanation. I'll look into the other issues and see if I can contribute in them.
I think its the controllers job to shut down a DaskCluster that has reached its
idle_timeout
. At least I see the following in the logs then:I think this sould be detailed to mention that the cluster is shut down due to
idle_timeout
similar to how for examplejupyter_server
stops idle kernels:I think practically this maybe should be a separate log message, beyond the "Stopping" message, but I think it should be clarified somehow why the cluster was stopped if possible.