Set worker memory limits at OS level?

gjoseph92 commented 2 years ago

In https://github.com/dask/distributed/issues/6110#issuecomment-1105837219, we found that workers were running themselves out of memory to the point where the machines became unresponsive. Because the memory limit in the Nanny is implemented at the application level, and in a periodic callback no less, there's nothing stopping workers from successfully allocating more memory than they're allowed to, as long as the Nanny doesn't catch them.

And as it turns out, if you allocate enough memory that you start heavily swapping (my guess, unconfirmed), but not so much that you get OOMkilled by the OS, it seems that you can effectively lock up the Nanny (and worker) Python processes, so the bad worker never gets caught, and everything just hangs. Memory limits are an important failsafe for stability, to un-stick this sort of situation.

A less brittle solution than this periodic callback might be to use the OS to enforce hard limits.

The logical approach would just be resource.setrlimit(resource.RLIMIT_RSS, memory_limit_in_bytes). However, it turns out that RLIMIT_RSS is not supported on newer Linux kernels. The solution nowadays appears to be cgroups.

Also relevant: https://jvns.ca/blog/2017/02/17/mystery-swap, https://unix.stackexchange.com/a/621576.

We could use memory.memsw.limit_in_bytes to limit total RAM+swap usage, or memory.limit_in_bytes to limit just RAM usage, or some smart combo of both. (Allowing a little swap might still be good for unmanaged memory.)

Obviously, this whole discussion is Linux-specific. I haven't found (or tried that hard to find) macOS and Windows approaches—I think there might be something for Windows, sounds like probably not for macOS. We can always keep the current periodic callback behavior around for them, though.

cc @fjetter

fjetter commented 2 years ago

I'm inclined to say this should be the responsibility of the deployment. On the generic level this library usually operates, I consider this low level configuration rather hard to maintain

gjoseph92 commented 2 years ago

I consider this low level configuration rather hard to maintain

cgroups should be a pretty stable API at this point. If we were just talking about resource.setrlimit, would you feel the same way? Is it just that cgroups sound too low-level/complex? Because it sounds like cgroups is just the modern equivalent of ulimit -m / resource.setrlimit.

I'm inclined to say this should be the responsibility of the deployment

Fair, but if it's the deployment's responsibility, then I think we shouldn't have the memory limit feature at all in the nanny. The way it's implemented isn't reliable enough.

To me, it's both simple to implement and quite useful, so I think it's reasonable to be the nanny's responsibility. But I'd be fine with removing the limit too.

crusaderky commented 2 years ago

+1. I like the idea to use the OS if possible and only fall back on the nanny polling system if not available.

fjetter commented 2 years ago

I am open to this if it actually solves a problem. I am used to having a resource / cluster manager around killing misbehaving pods so I am a bit biased. If this would be helpful for most users, I am open to this but would like to get some feedback from people who are actually working with deployments.

@dchudz @jacobtomlinson any opinions? Would this be helpful? Would you prefer implementing this as part of the deployment or should dask do this?

Just a bunch of questions in the meantime

can we control cgroups from python?
Are there any elevated permissions required?
Is there an equivalent for windows or how would we deal with this?

gjoseph92 commented 2 years ago

Update here: unsurprisingly, you can't use normally cgroups if you're already inside a Docker container.

I think we should still try to do it in dask (for non-containerized workloads, it would be helpful) and fall back on polling if /sys/fs/cgroups/<cgroup-name> isn't writeable. But as @fjetter said, if the deployment system is using containers, it will need to either set the limit itself, or give dask permissions to access cgroups.

https://stackoverflow.com/questions/32534203/mounting-cgroups-inside-a-docker-container https://groups.google.com/g/kubernetes-dev/c/TBNzAxNXPOA (I also tested this, SSHing into a coiled worker.)

jacobtomlinson commented 2 years ago

This feels like a duplicate of #4558 and the case you describe could be useful there. I generally agree with @fjetter that this should be the responsibility of the deployment tooling or OOM. I don't think Dask itself should be tinkering with groups, especially if it requires elevated privileges in a container environment.

I wonder if there is an alternative where we could just get things to trigger the OOM as expected.

gjoseph92 commented 2 years ago

Forgot to write this, but for posterity: I get the sentiment that deployment tooling should be responsible for setting memory limits if desired, but that's not quite the model that dask offers.

The Nanny is, in effect, a deployment tool offered by dask. Its job is manage a Worker subprocess, kill it if it uses too much memory, and restart it if it dies. So I'd argue it's entirely within scope for the Nanny to enforce memory limits at a system level, since it's a deployment tool.

If you're using a Nanny, it must be the one to set the system memory limit. If you set the limit on the outer Nanny process, instead of the inner worker process, then when the worker uses too much memory, the whole Nanny will get killed. Your worker won't be restarted any more.
1. Obviously you might be using a deployment tool like k8s pods, which restarts the process automatically. But in that case, you didn't need a Nanny at all, and shouldn't be using one. So that's not relevant here. We're only talking about Nanny-based deployments where distributed.worker.memory.terminate is set.
Dask offers the option to configure a memory kill threshold right now. If we offer the option, I just think it should be implemented in a way that actually works. If we don't want to implement it in a way that actually works (cgroups), we should probably not offer it at all, and instead say in the docs that we recommend using a deployment tool like XXX to enforce memory limits and restart your worker processes instead of a Nanny.

I wonder if there is an alternative where we could just get things to trigger the OOM as expected

That's worth some research. Based on my understanding of the problem though https://github.com/dask/distributed/issues/6110#issuecomment-1109052228, it would basically involve disabling the disk cache, which is of course not acceptable. My guess is that any thing we could do here would be more intrusive and fiddly than using cgroups.

crusaderky commented 2 years ago

If we don't want to implement it in a way that actually works (cgroups)

The implication that polling "does not actually work" feels very drastic to me. It works fine if there is a (small) swap file mounted. It breaks specifically when the OS starts deallocating the executables memory, which only happens after the swap file is full.

crusaderky commented 2 years ago

I can think of ways to reduce the problem. For example:

We could have a dynamic polling interval which automatically drops to as little as 1ms when you approach the 95% threshold.

We could be a lot more conservative in setting the automatic memory limit. E.g. We can easily detect with psutil if there's a swap file and take an educated guess that

If there's a swap file, we're likely on somebody's laptop or on a traditional multi-purpose server. Memory in use by other processes will likely fluctatue wildly over time and it's OK to rely on a bit of swapping out - keep current behaviour
If there's no swap file, it's very likely we're on a container or specialised VM. Calculate max_memory as a function of available memory at the moment of starting instead of total mounted memory, and add some legroom for good measure.
alternatively we could deprecate automatic max_memory if there's no swap file.

gjoseph92 commented 2 years ago

Couple of interesting things I've found on on the topic. Not solutions dask would implement, but just useful for learning more.

It could be worth playing with various /proc/sys/vm settings like disabling overcommitting. Dask would not set these directly, but they could be things we'd recommend in the docs, and deployment systems like dask-cloudprovider, dask-k8s, coiled, etc. might be able to do.

https://www.kernel.org/doc/html/latest/mm/overcommit-accounting.html and /proc/sys/vm/overcommit_memory = 2, maybe with /prov/sys/vm/admin_reserve_kbytes
https://github.com/facebookincubator/oomd
Resorting to a kernel patch: https://stackoverflow.com/questions/52067753/how-to-keep-executable-code-in-memory-even-under-memory-pressure-in-linux

jacobtomlinson commented 2 years ago

My reading of all the above comments is that this only applies to linux workers that do not have swap configured and are not running in containers.

I would be curious to know what percentage of workers that is as I think most are already enforcing cgroups at the resource manager level. Basically, every way I deploy Dask these days is either inside a container or on an HPC.

LocalCluster is likely on a machine with swap, SSHCluster is likely also on machines with swap, KubeCluster uses Kubernetes, EC2Cluster/AzureVMCluster/GCPCluster all use containers inside the VMs, ECSCluster uses containers, SLURMCluster/PBSCluster/etc use HPC resource managers that generally enforce cgroups.

Who are the affected users of this problem?

crusaderky commented 2 years ago

My reading of all the above comments is that this only applies to linux workers that do not have swap configured and are not running in containers.

Correct for the absence of swap file. I think however that the spilling of the executable memory is something that @gjoseph92 observed on Coiled - e.g. docker images. Do I understand correctly?

shughes-uk commented 2 years ago

I'm not sure if this comment belongs here, or a new issue.

cgroups have two limits, a 'hard' limit and a 'soft' limit. For v2 (and I think v1) the cgroup docs state.

If a cgroup's memory use goes over the high boundary specified here, the cgroup's processes are throttled and put under heavy reclaim pressure

The v1 docs are a bit more unclear but I suspect the same mechanism kicks in. It might explain the heavy swapping without OOMKill happening that @gjoseph92 is talking about.

I think i'd strongly prefer that dask attempts to stay under the soft limit, and can automatically detect/use the limit. Without doing so it's just going to either end up in swap hell or get OOMKilled with no warning.

How dask achieves should be comfortably compatible with an unprivileged container and non-root user.

shughes-uk commented 2 years ago

I've created an MR to move this forward a bit https://github.com/dask/distributed/pull/7051 with respect to detecting and obeying existing cgroup limits.

crusaderky commented 2 years ago

I think i'd strongly prefer that dask attempts to stay under the soft limit, and can automatically detect/use the limit. Without doing so it's just going to either end up in swap hell or get OOMKilled with no warning.

The thing is, the whole system is designed so that it's resilient to an abrupt OOM kill. What it's not resilient to is a worker becoming sluggish (but not completely unresponsive) due to swap file trashing. So OOMkill is always preferrable.

shughes-uk commented 2 years ago

Both are a fail state though, OOMKill just being more recoverable, neither are desirable.

dask / distributed

Set worker memory limits at OS level? #6177