Snapd / watchdog timeout while python process is frozen if process operates close to memory limit

fjetter commented 2 years ago

When operating close to the VMs memory limit (but still subjectively far enough to cause problems) the dask worker process appears to freeze for up to a couple of minutes.

from coiled import Cluster
import dask
cluster = Cluster(
    worker_vm_types=['t3.large'],
    n_workers=1,
)
import math
from time import time

def allocate_some_memory(num_gib):
    multiplier = int(math.floor(1024**3 * num_gib))
    mem = b" " * multiplier
    import psutil
    rss = psutil.Process().memory_info().rss
    import time
    time.sleep(5)
    return rss

client.submit(allocate_some_memory, 6.9).result()

The worker logs reveal a bit of what's going on

Note: these logs were generated by running a lambda function prior to the above task, hence the logs.

The below is an excerpt of the logs. There appears to be some kind of 5min watchdog timeout that is triggered. whatever happens during these 5 minutes, it apparently is also freezing the python process since we're receiving warnings about the event loop being stuck.

Aug 12 10:44:10 ip-10-0-13-102 cloud-init[1231]: 2022-08-12 10:44:10,089 - distributed.worker - INFO - Run out-of-band function 'lambda'
Aug 12 10:49:07 ip-10-0-13-102 systemd[1]: snapd.service: Watchdog timeout (limit 5min)!
Aug 12 10:49:07 ip-10-0-13-102 cloud-init[1231]: 2022-08-12 10:49:07,554 - distributed.core - INFO - Event loop was unresponsive in Nanny for 292.68s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
Aug 12 10:49:07 ip-10-0-13-102 cloud-init[1231]: 2022-08-12 10:49:07,651 - distributed.core - INFO - Event loop was unresponsive in Worker for 297.42s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
Aug 12 10:49:07 ip-10-0-13-102 snapd[545]: SIGABRT: abort
...... # Many network related logs of snapd
Aug 12 10:49:07 ip-10-0-13-102 systemd[1]: snapd.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Aug 12 10:49:07 ip-10-0-13-102 systemd[1]: snapd.service: Failed with result 'watchdog'.
Aug 12 10:49:08 ip-10-0-13-102 systemd[1]: snapd.service: Scheduled restart job, restart counter is at 1.
Aug 12 10:49:08 ip-10-0-13-102 systemd[1]: Stopped Snap Daemon.
Aug 12 10:49:08 ip-10-0-13-102 systemd[1]: Starting Snap Daemon...
Aug 12 10:49:08 ip-10-0-13-102 snapd[2041]: AppArmor status: apparmor is enabled and all features are available
Aug 12 10:49:09 ip-10-0-13-102 snapd[2041]: AppArmor status: apparmor is enabled and all features are available
Aug 12 10:49:09 ip-10-0-13-102 snapd[2041]: overlord.go:263: Acquiring state lock file
Aug 12 10:49:09 ip-10-0-13-102 snapd[2041]: overlord.go:268: Acquired state lock file
Aug 12 10:49:09 ip-10-0-13-102 snapd[2041]: daemon.go:247: started snapd/2.56.2 (series 16; classic) ubuntu/20.04 (amd64) linux/5.13.0-1031-aws.
Aug 12 10:49:09 ip-10-0-13-102 kernel: [  683.168564] loop7: detected capacity change from 0 to 8
Aug 12 10:49:09 ip-10-0-13-102 systemd[1225]: tmp-syscheck\x2dmountpoint\x2d104205793.mount: Succeeded.
Aug 12 10:49:09 ip-10-0-13-102 systemd[1]: tmp-syscheck\x2dmountpoint\x2d104205793.mount: Succeeded.
Aug 12 10:49:09 ip-10-0-13-102 snapd[2041]: daemon.go:340: adjusting startup timeout by 55s (pessimistic estimate of 30s plus 5s per snap)
Aug 12 10:49:09 ip-10-0-13-102 systemd[1]: Started Snap Daemon.
Aug 12 10:49:09 ip-10-0-13-102 dbus-daemon[527]: [system] Activating via systemd: service name='org.freedesktop.timedate1' unit='dbus-org.freedesktop.timedate1.service' requested by ':1.62' (uid=0 pid=2041 comm="/usr/lib/snapd/snapd " label="unconfined")
Aug 12 10:49:09 ip-10-0-13-102 systemd[1]: Starting Time & Date Service...
Aug 12 10:49:09 ip-10-0-13-102 dbus-daemon[527]: [system] Successfully activated service 'org.freedesktop.timedate1'
Aug 12 10:49:09 ip-10-0-13-102 systemd[1]: Started Time & Date Service.

Full snapd logs

https://cloud.coiled.io/dask-engineering/clusters/49105/2/details

ntabris commented 2 years ago

FYI, you're allocating 7.4GiB (=6.9GB) on a machine that has 8GiB. If you want to allocate in GiB, use 1000**3.

fjetter commented 2 years ago

FYI, you're allocating 7.4GiB (=6.9GB) on a machine that has 8GiB. If you want to allocate in GiB, use 1000**3.

I have a tendency to mix the two up. I think I'm using it right, am I not? GiB is using a base of 2 and AWS t3.large instances are listed as 8GiB, i.e. 8 * 1024 **3 B? https://aws.amazon.com/ec2/instance-types/t3/

Anyhow, this is a problem regardless of units

shughes-uk commented 2 years ago

Does dask protect itself from using all available memory? We can use cgroup limits to reserve a small amount for the core system processes but my understanding is dask does have some internal mechanisms that should prevent it from just consuming everything and getting itself OOM killed

If that mechanism does not take into account other processes using memory then maybe this is more a dask fix than a platform issue? If it does and the mechanism is broken just in on the platform, we can definitely figure it out and fix it

fjetter commented 2 years ago

Does dask protect itself from using all available memory?

It has a mechanism but that only works if the memory_limit is properly configured. You need to tell dask how much memory it is allowed to be used and this configuration is far from the reality, See https://github.com/coiled/feedback/issues/185

gjoseph92 commented 2 years ago

Does dask protect itself from using all available memory?

@shughes-uk assume that it does not: https://github.com/dask/distributed/issues/6177. It has a memory_limit parameter that sometimes will kill the worker if it uses too much memory, but it only works if the limit is far under the amount of available memory, and the allocations are small. There's nothing dask does to prevent a single large allocation (exactly like @fjetter is doing here) from exceeding that limit. Using cgroups is what that issue proposes doing. Consensus is against doing that though. So let's assume that dask will not enforce any sort of memory limit, and that's expected to be the job of the deployment system (Coiled in this case).

And no, the limit also does not take into account how much memory is used by other things on the system, just how much total physical memory is available, or what the cgroup limit is.

ntabris commented 2 years ago

I'm not familiar with snapd. These are the snap packages I see on an EC2 worker instance:

ubuntu@ip-10-6-63-8:~$ snap list
Name              Version        Rev    Tracking         Publisher   Notes
amazon-ssm-agent  3.1.1188.0     5656   latest/stable/…  aws✓        classic
core18            20220706       2538   latest/stable    canonical✓  base
core20            20220805       1611   latest/stable    canonical✓  base
lxd               4.0.9-8e2046b  22753  4.0/stable/…     canonical✓  -
snapd             2.56.2         16292  latest/stable    canonical✓  snapd

(I'll continue to do some more digging, using this issue as a notepad as I see things.)

shughes-uk commented 2 years ago

I suggest that snapd being terminated by the systemd watchdog is a symptom of the whole system becoming unresponsive rather than the cause.

fjetter commented 2 years ago

If this issue is triggered by a process coming very close to its memory limit this may be already fixed by https://github.com/coiled/feedback/issues/185 since we likely kill the process before it gets to this stage

fjetter commented 2 years ago

Closing. I think this has been resolved together with https://github.com/coiled/feedback/issues/185

coiled / feedback

Snapd / watchdog timeout while python process is frozen if process operates close to memory limit #186