Closed fjetter closed 2 years ago
FYI, you're allocating 7.4GiB (=6.9GB) on a machine that has 8GiB. If you want to allocate in GiB, use 1000**3
.
FYI, you're allocating 7.4GiB (=6.9GB) on a machine that has 8GiB. If you want to allocate in GiB, use 1000**3.
I have a tendency to mix the two up. I think I'm using it right, am I not? GiB is using a base of 2 and AWS t3.large instances are listed as 8GiB, i.e. 8 * 1024 **3 B? https://aws.amazon.com/ec2/instance-types/t3/
Anyhow, this is a problem regardless of units
Does dask protect itself from using all available memory? We can use cgroup limits to reserve a small amount for the core system processes but my understanding is dask does have some internal mechanisms that should prevent it from just consuming everything and getting itself OOM killed
If that mechanism does not take into account other processes using memory then maybe this is more a dask fix than a platform issue? If it does and the mechanism is broken just in on the platform, we can definitely figure it out and fix it
Does dask protect itself from using all available memory?
It has a mechanism but that only works if the memory_limit is properly configured. You need to tell dask how much memory it is allowed to be used and this configuration is far from the reality, See https://github.com/coiled/feedback/issues/185
Does dask protect itself from using all available memory?
@shughes-uk assume that it does not: https://github.com/dask/distributed/issues/6177. It has a memory_limit
parameter that sometimes will kill the worker if it uses too much memory, but it only works if the limit is far under the amount of available memory, and the allocations are small. There's nothing dask does to prevent a single large allocation (exactly like @fjetter is doing here) from exceeding that limit. Using cgroups is what that issue proposes doing. Consensus is against doing that though. So let's assume that dask will not enforce any sort of memory limit, and that's expected to be the job of the deployment system (Coiled in this case).
And no, the limit also does not take into account how much memory is used by other things on the system, just how much total physical memory is available, or what the cgroup limit is.
I'm not familiar with snapd. These are the snap packages I see on an EC2 worker instance:
ubuntu@ip-10-6-63-8:~$ snap list
Name Version Rev Tracking Publisher Notes
amazon-ssm-agent 3.1.1188.0 5656 latest/stable/… aws✓ classic
core18 20220706 2538 latest/stable canonical✓ base
core20 20220805 1611 latest/stable canonical✓ base
lxd 4.0.9-8e2046b 22753 4.0/stable/… canonical✓ -
snapd 2.56.2 16292 latest/stable canonical✓ snapd
(I'll continue to do some more digging, using this issue as a notepad as I see things.)
I suggest that snapd
being terminated by the systemd watchdog is a symptom of the whole system becoming unresponsive rather than the cause.
If this issue is triggered by a process coming very close to its memory limit this may be already fixed by https://github.com/coiled/feedback/issues/185 since we likely kill the process before it gets to this stage
Closing. I think this has been resolved together with https://github.com/coiled/feedback/issues/185
When operating close to the VMs memory limit (but still subjectively far enough to cause problems) the dask worker process appears to freeze for up to a couple of minutes.
The worker logs reveal a bit of what's going on
Note: these logs were generated by running a lambda function prior to the above task, hence the logs.
The below is an excerpt of the logs. There appears to be some kind of 5min watchdog timeout that is triggered. whatever happens during these 5 minutes, it apparently is also freezing the python process since we're receiving warnings about the event loop being stuck.
Full snapd logs
https://cloud.coiled.io/dask-engineering/clusters/49105/2/details
Related https://github.com/coiled/feedback/issues/185