Open gjoseph92 opened 2 years ago
I'm inclined to say this should be the responsibility of the deployment. On the generic level this library usually operates, I consider this low level configuration rather hard to maintain
I consider this low level configuration rather hard to maintain
cgroups should be a pretty stable API at this point. If we were just talking about resource.setrlimit
, would you feel the same way? Is it just that cgroups sound too low-level/complex? Because it sounds like cgroups is just the modern equivalent of ulimit -m
/ resource.setrlimit
.
I'm inclined to say this should be the responsibility of the deployment
Fair, but if it's the deployment's responsibility, then I think we shouldn't have the memory limit feature at all in the nanny. The way it's implemented isn't reliable enough.
To me, it's both simple to implement and quite useful, so I think it's reasonable to be the nanny's responsibility. But I'd be fine with removing the limit too.
+1. I like the idea to use the OS if possible and only fall back on the nanny polling system if not available.
I am open to this if it actually solves a problem. I am used to having a resource / cluster manager around killing misbehaving pods so I am a bit biased. If this would be helpful for most users, I am open to this but would like to get some feedback from people who are actually working with deployments.
@dchudz @jacobtomlinson any opinions? Would this be helpful? Would you prefer implementing this as part of the deployment or should dask do this?
Just a bunch of questions in the meantime
Update here: unsurprisingly, you can't use normally cgroups if you're already inside a Docker container.
I think we should still try to do it in dask (for non-containerized workloads, it would be helpful) and fall back on polling if /sys/fs/cgroups/<cgroup-name>
isn't writeable.
But as @fjetter said, if the deployment system is using containers, it will need to either set the limit itself, or give dask permissions to access cgroups.
https://stackoverflow.com/questions/32534203/mounting-cgroups-inside-a-docker-container https://groups.google.com/g/kubernetes-dev/c/TBNzAxNXPOA (I also tested this, SSHing into a coiled worker.)
This feels like a duplicate of #4558 and the case you describe could be useful there. I generally agree with @fjetter that this should be the responsibility of the deployment tooling or OOM. I don't think Dask itself should be tinkering with groups, especially if it requires elevated privileges in a container environment.
I wonder if there is an alternative where we could just get things to trigger the OOM as expected.
Forgot to write this, but for posterity: I get the sentiment that deployment tooling should be responsible for setting memory limits if desired, but that's not quite the model that dask offers.
The Nanny is, in effect, a deployment tool offered by dask. Its job is manage a Worker subprocess, kill it if it uses too much memory, and restart it if it dies. So I'd argue it's entirely within scope for the Nanny to enforce memory limits at a system level, since it's a deployment tool.
distributed.worker.memory.terminate
is set.I wonder if there is an alternative where we could just get things to trigger the OOM as expected
That's worth some research. Based on my understanding of the problem though https://github.com/dask/distributed/issues/6110#issuecomment-1109052228, it would basically involve disabling the disk cache, which is of course not acceptable. My guess is that any thing we could do here would be more intrusive and fiddly than using cgroups.
If we don't want to implement it in a way that actually works (cgroups)
The implication that polling "does not actually work" feels very drastic to me. It works fine if there is a (small) swap file mounted. It breaks specifically when the OS starts deallocating the executables memory, which only happens after the swap file is full.
I can think of ways to reduce the problem. For example:
We could have a dynamic polling interval which automatically drops to as little as 1ms when you approach the 95% threshold.
We could be a lot more conservative in setting the automatic memory limit. E.g. We can easily detect with psutil if there's a swap file and take an educated guess that
Couple of interesting things I've found on on the topic. Not solutions dask would implement, but just useful for learning more.
It could be worth playing with various /proc/sys/vm
settings like disabling overcommitting. Dask would not set these directly, but they could be things we'd recommend in the docs, and deployment systems like dask-cloudprovider, dask-k8s, coiled, etc. might be able to do.
/proc/sys/vm/overcommit_memory = 2
, maybe with /prov/sys/vm/admin_reserve_kbytes
My reading of all the above comments is that this only applies to linux workers that do not have swap configured and are not running in containers.
I would be curious to know what percentage of workers that is as I think most are already enforcing cgroups at the resource manager level. Basically, every way I deploy Dask these days is either inside a container or on an HPC.
LocalCluster
is likely on a machine with swap, SSHCluster
is likely also on machines with swap, KubeCluster
uses Kubernetes, EC2Cluster
/AzureVMCluster
/GCPCluster
all use containers inside the VMs, ECSCluster
uses containers, SLURMCluster
/PBSCluster
/etc use HPC resource managers that generally enforce cgroups.
Who are the affected users of this problem?
My reading of all the above comments is that this only applies to linux workers that do not have swap configured and are not running in containers.
Correct for the absence of swap file. I think however that the spilling of the executable memory is something that @gjoseph92 observed on Coiled - e.g. docker images. Do I understand correctly?
I'm not sure if this comment belongs here, or a new issue.
cgroups have two limits, a 'hard' limit and a 'soft' limit. For v2 (and I think v1) the cgroup docs state.
If a cgroup's memory use goes over the high boundary specified here, the cgroup's processes are throttled and put under heavy reclaim pressure
The v1 docs are a bit more unclear but I suspect the same mechanism kicks in. It might explain the heavy swapping without OOMKill happening that @gjoseph92 is talking about.
I think i'd strongly prefer that dask attempts to stay under the soft limit, and can automatically detect/use the limit. Without doing so it's just going to either end up in swap hell or get OOMKilled with no warning.
How dask achieves should be comfortably compatible with an unprivileged container and non-root user.
I've created an MR to move this forward a bit https://github.com/dask/distributed/pull/7051 with respect to detecting and obeying existing cgroup limits.
I think i'd strongly prefer that dask attempts to stay under the soft limit, and can automatically detect/use the limit. Without doing so it's just going to either end up in swap hell or get OOMKilled with no warning.
The thing is, the whole system is designed so that it's resilient to an abrupt OOM kill. What it's not resilient to is a worker becoming sluggish (but not completely unresponsive) due to swap file trashing. So OOMkill is always preferrable.
Both are a fail state though, OOMKill just being more recoverable, neither are desirable.
In https://github.com/dask/distributed/issues/6110#issuecomment-1105837219, we found that workers were running themselves out of memory to the point where the machines became unresponsive. Because the memory limit in the Nanny is implemented at the application level, and in a periodic callback no less, there's nothing stopping workers from successfully allocating more memory than they're allowed to, as long as the Nanny doesn't catch them.
And as it turns out, if you allocate enough memory that you start heavily swapping (my guess, unconfirmed), but not so much that you get OOMkilled by the OS, it seems that you can effectively lock up the Nanny (and worker) Python processes, so the bad worker never gets caught, and everything just hangs. Memory limits are an important failsafe for stability, to un-stick this sort of situation.
A less brittle solution than this periodic callback might be to use the OS to enforce hard limits.
The logical approach would just be
resource.setrlimit(resource.RLIMIT_RSS, memory_limit_in_bytes)
. However, it turns out thatRLIMIT_RSS
is not supported on newer Linux kernels. The solution nowadays appears to be cgroups.Also relevant: https://jvns.ca/blog/2017/02/17/mystery-swap, https://unix.stackexchange.com/a/621576.
We could use
memory.memsw.limit_in_bytes
to limit total RAM+swap usage, ormemory.limit_in_bytes
to limit just RAM usage, or some smart combo of both. (Allowing a little swap might still be good for unmanaged memory.)Obviously, this whole discussion is Linux-specific. I haven't found (or tried that hard to find) macOS and Windows approaches—I think there might be something for Windows, sounds like probably not for macOS. We can always keep the current periodic callback behavior around for them, though.
cc @fjetter