bg_mon process stuck in uninterruptible sleep

bo0ts commented 5 months ago

We are looking at an issue that some of our postgres instances running with the Zalando Operator (Version 1.10.1, PVCs on NFS, Spilo Image spilo-15:3.0-p1), sometimes get stuck during restart. The instances cannot be killed with pg_ctl -m immediate stop and when using pg_ctl kill KILL, usually a single process stuck in uninterruptible sleep remains:

1000710+ 1191929  0.0  0.0 630156 26628 ?        Ds   Jan18   2:05 postgres: our-postgres: bg_mon

Unfortunately I missed capturing the stack of that process before finally rebooting the machine (will certainly do it next time), but I would like to start a discussion here if there are any known issues around this or if you have any suspicions what could cause this.

bo0ts commented 4 months ago

And this time I could actually capture a stacktrace of the problem:

Feb 14 07:33:13 myhost.intern kernel: sysrq: Show Blocked State
Feb 14 07:33:13 myhost.intern kernel: task:postgres        state:D stack:0     pid:2581549 ppid:2581545 flags:0x00000006
Feb 14 07:33:13 myhost.intern kernel: Call Trace:
Feb 14 07:33:13 myhost.intern kernel:  <TASK>
Feb 14 07:33:13 myhost.intern kernel:  __schedule+0x35f/0x1360
Feb 14 07:33:13 myhost.intern kernel:  ? avc_has_perm_noaudit+0x8b/0xf0
Feb 14 07:33:13 myhost.intern kernel:  ? preempt_count_add+0x6a/0xa0
Feb 14 07:33:13 myhost.intern kernel:  ? nfs_do_lookup_revalidate+0x260/0x260 [nfs]
Feb 14 07:33:13 myhost.intern kernel:  schedule+0x5d/0xe0
Feb 14 07:33:13 myhost.intern kernel:  __nfs_lookup_revalidate+0xdd/0x120 [nfs]
Feb 14 07:33:13 myhost.intern kernel:  ? psi_poll_worker+0x4f0/0x4f0
Feb 14 07:33:13 myhost.intern kernel:  lookup_fast+0x74/0xe0
Feb 14 07:33:13 myhost.intern kernel:  path_openat+0xf6/0x1230
Feb 14 07:33:13 myhost.intern kernel:  do_filp_open+0x9e/0x130
Feb 14 07:33:13 myhost.intern kernel:  do_sys_openat2+0x96/0x150
Feb 14 07:33:13 myhost.intern kernel:  __x64_sys_openat+0x5c/0x80
Feb 14 07:33:13 myhost.intern kernel:  do_syscall_64+0x5b/0x80
Feb 14 07:33:13 myhost.intern kernel:  ? fpregs_restore_userregs+0x56/0xe0
Feb 14 07:33:13 myhost.intern kernel:  ? exit_to_user_mode_prepare+0x18f/0x1f0
Feb 14 07:33:13 myhost.intern kernel:  ? syscall_exit_to_user_mode+0x17/0x40
Feb 14 07:33:13 myhost.intern kernel:  ? do_syscall_64+0x67/0x80
Feb 14 07:33:13 myhost.intern kernel:  ? syscall_exit_to_user_mode_prepare+0x18e/0x1c0
Feb 14 07:33:13 myhost.intern kernel:  ? syscall_exit_to_user_mode+0x17/0x40
Feb 14 07:33:13 myhost.intern kernel:  ? do_syscall_64+0x67/0x80
Feb 14 07:33:13 myhost.intern kernel:  ? do_syscall_64+0x67/0x80
Feb 14 07:33:13 myhost.intern kernel:  ? do_syscall_64+0x67/0x80
Feb 14 07:33:13 myhost.intern kernel:  entry_SYSCALL_64_after_hwframe+0x63/0xcd

CyberDem0n commented 1 month ago

It could get into such state in couple of places, where it:

traverse procfs to get statistics for postgres processes (cpu/memory/state)
traverse $PGDATA to calculate size of it by summing up sizes of all files.

By looking at stacktrace I can see a call nfs_do_lookup_revalidate, which makes me think that you are dealing with the second case. Or in other words, you keep $PGDATA on NFS and it becomes unresponsive.

bo0ts commented 1 month ago

Yes, we hit the second case. The odd thing is that no other application has trouble with the NFS during that time and it only happens during pod termination. There are no visible issues while the container is running.

CyberDem0n commented 1 month ago

it only happens during pod termination

It looks like a race condition. That is, K8s for some reason decides to unmount NFS before Pod/Container are fully stopped.

CyberDem0n / bg_mon

bg_mon process stuck in uninterruptible sleep #63