Open staticfloat opened 1 year ago
Looks like if the average cpu utilization is above 20% for many sequential intervals then they need a reboot. FWIW, in my experience the only way to build fully reliable systems is to detect bad states and restart.
Also, in this case nginx is not something we really need to be in the business of ensuring the reliability of...
Should we switch from Nginx to something else?
I don't think this is nginx's fault, I'm pretty sure this is a kernel bug.
Looking in
dmesg
after aecho w > /proc/sysrq-trigger
showed that they were stuck in the kernel during anfsync
.
Do you have a full backtrace and also what's the filesytem?
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189515] task:nginx state:D stack: 0 pid:1248331 ppid:1248210 flags:0x00000000
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189518] Call Trace:
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189519] <TASK>
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189520] __schedule+0x2cd/0x8a0
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189523] ? crypto_shash_update+0x1e/0x30
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189527] schedule+0x4e/0xc0
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189530] schedule_timeout+0x18f/0x2d0
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189533] ? _raw_spin_unlock_irqrestore+0xe/0x30
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189536] ? del_timer_sync+0x40/0x40
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189543] io_schedule_timeout+0x1e/0x50
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189545] balance_dirty_pages+0x35a/0xe60
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189549] balance_dirty_pages_ratelimited+0x311/0x3e0
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189552] generic_perform_write+0x138/0x1f0
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189556] ext4_buffered_write_iter+0x98/0x160
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189561] ext4_file_write_iter+0x53/0x230
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189565] ? common_file_perm+0x72/0x180
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189567] do_iter_readv_writev+0x152/0x1b0
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189570] do_iter_write+0x8c/0x1d0
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189573] vfs_iter_write+0x19/0x30
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189577] ovl_write_iter+0x2d3/0x490 [overlay]
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189586] do_iter_readv_writev+0x152/0x1b0
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189588] do_iter_write+0x8c/0x1d0
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189590] vfs_writev+0x83/0x150
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189593] do_pwritev+0x97/0xe0
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189595] __x64_sys_pwritev+0x21/0x30
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189597] do_syscall_64+0x5c/0xc0
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189601] ? __x64_sys_pwritev+0x21/0x30
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189602] ? do_syscall_64+0x69/0xc0
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189604] ? do_syscall_64+0x69/0xc0
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189607] ? do_syscall_64+0x69/0xc0
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189609] entry_SYSCALL_64_after_hwframe+0x61/0xcb
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189613] RIP: 0033:0x7fa00ea0c829
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189616] RSP: 002b:00007ffde9500e18 EFLAGS: 00000246 ORIG_RAX: 0000000000000128
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189618] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fa00ea0c829
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189620] RDX: 0000000000000002 RSI: 00007ffde9500e40 RDI: 00000000000001da
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189622] RBP: 000055e6541b7058 R08: 0000000000000000 R09: 0000000000002000
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189623] R10: 0000000015338000 R11: 0000000000000246 R12: 0000000000000000
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189625] R13: 0000000015338000 R14: 0000000000000000 R15: 00007ffde9500e20
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189627] </TASK>
The root filesystem is ext4
, (which is reinforced by the stack trace above) but since it's running inside of docker, overlayfs
likely is involved as well. I unfortunately do not have multiple stack traces to verify at which point in the stack above it was stuck.
I thought you said it was in fsync?
Indeed I did, but I must have crossed some wires, because it was actually a dpkg
task that was stuck in fsync
:
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189635] task:dpkg state:D stack: 0 pid:1251567 ppid:1250276 flags:0x00004000
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189637] Call Trace:
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189638] <TASK>
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189639] __schedule+0x2cd/0x8a0
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189642] schedule+0x4e/0xc0
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189644] io_schedule+0x16/0x40
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189647] wait_on_page_bit_common+0x16e/0x3b0
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189651] ? filemap_invalidate_unlock_two+0x50/0x50
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189654] wait_on_page_bit+0x3f/0x50
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189657] wait_on_page_writeback+0x26/0x80
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189659] __filemap_fdatawait_range+0x9f/0x120
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189663] ? do_writepages+0xc5/0x1c0
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189665] ? __ext4_mark_inode_dirty+0x200/0x200
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189668] ? do_writepages+0xc5/0x1c0
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189671] file_write_and_wait_range+0xd3/0xf0
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189675] ext4_sync_file+0xf6/0x320
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189677] vfs_fsync_range+0x49/0x90
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189679] do_fsync+0x3d/0x80
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189681] ? __x64_sys_ftruncate+0x1b/0x30
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189686] __x64_sys_fsync+0x14/0x20
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189688] do_syscall_64+0x5c/0xc0
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189690] ? do_syscall_64+0x69/0xc0
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189692] entry_SYSCALL_64_after_hwframe+0x61/0xcb
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189695] RIP: 0033:0x7f123a0781a7
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189697] RSP: 002b:00007ffd029d10b8 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189699] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f123a0781a7
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189700] RDX: 0000000000000000 RSI: 00000000000001ed RDI: 0000000000000004
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189701] RBP: 000056408be47fa0 R08: 0000000000000000 R09: 0000000000002800
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189702] R10: 000056408b9677b6 R11: 0000000000000246 R12: 000056408bde52a0
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189704] R13: 00000000000001ed R14: 000056408bdf6580 R15: 0000000000000000
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189706] </TASK>
Ok, there are a couple of kernel bugs with backtraces that look like this, but nothing particularly recent. I would recommend a kernel upgrade and if it still happens, let me know next time and i can try to diagnose on a live system.
We upgraded from 5.15.0-1022-aws
(the version where the bad stuff happened) to 5.15.0-1041-aws
(the latest as of yesterday). If it happens again, I'll let you know.
In my ideally world we would have monitoring that detects this, records whatever dmsg info it needs to diagnose this, notifies us (on #infrastructure?), and automatically reboots the machine. Or would it be better to keep this machine around for debugging and replace it with a new machine?
We had a discussion about the
us-east
loadbalancer getting slow. Initial inspection showed that the network interface was rarely achieving more than 180Mbps out. Diving deeper, it was found that somenginx
processes had become stuck in "uninterruptible sleep" (D
inps
output). Looking indmesg
after aecho w > /proc/sysrq-trigger
showed that they were stuck in the kernel during anfsync
.This is a fairly pathological failure, but there are a few things we could do to ameliorate it:
nginx
processes. Right now we only have 2 processes on the loadbalancer; we could probably double this and go up to 2x per core (so 4 processes total) without any harmful effects, which would at least delay the problem in the future. This needs a templating step on the "optimized" nginx config to insert$(($(nproc) * 2))
into theworker_processes
directive.I am loathe to do something drastic like auto-reboot the loadbalancer because it is supposed to be the piece that you don't have to reboot. If this happens again, I'll consider it.