dask / dask-gateway

A multi-tenant server for securely deploying and managing Dask clusters.
https://gateway.dask.org/
BSD 3-Clause "New" or "Revised" License
136 stars 88 forks source link

Should the dask-gateway helm chart disable the worker pod's nanny? #734

Closed consideRatio closed 9 months ago

consideRatio commented 1 year ago

In https://distributed.dask.org/en/stable/killed.html#killed-by-nanny its documented that it could make sense to disable the nanny. I wonder if it makes sense for the dask-gateway helm chart to do that by default or not but I don't know the details well enough to determine this.

If you have an external system for watching memory usage provided by your cluster infrastructure (HPC, kubernetes, etc.), then it may be reasonable to turn off this memory limit. Indeed, in these cases, restarts might be handled for you too, so you could do without the nanny at all (--no-nanny CLI option or configuration equivalent).

I know that a k8s pod running a container has a restartPolicy defaulting to Always, meaning that a container that crashes will restart by default. Is that making the nanny unnessecary?

jacobtomlinson commented 1 year ago

I would only do this if there is a clear benefit to disabling the nanny or some problem that it is causing. For example on some HPC systems the nanny causes problems because each job can only start one process.