kernel ext4 lockup causing `nginx` slowdown

staticfloat commented 1 year ago

We had a discussion about the us-east loadbalancer getting slow. Initial inspection showed that the network interface was rarely achieving more than 180Mbps out. Diving deeper, it was found that some nginx processes had become stuck in "uninterruptible sleep" (D in ps output). Looking in dmesg after a echo w > /proc/sysrq-trigger showed that they were stuck in the kernel during an fsync.

This is a fairly pathological failure, but there are a few things we could do to ameliorate it:

[ ] Oversubscribe the number of nginx processes. Right now we only have 2 processes on the loadbalancer; we could probably double this and go up to 2x per core (so 4 processes total) without any harmful effects, which would at least delay the problem in the future. This needs a templating step on the "optimized" nginx config to insert $(($(nproc) * 2)) into the worker_processes directive.
[ ] Better monitoring on the instances; our grafana server has bitrotted, we should resurrect that so that we can notice the lowered network throughput and increased CPU time as shown in this lightsail graph:

I am loathe to do something drastic like auto-reboot the loadbalancer because it is supposed to be the piece that you don't have to reboot. If this happens again, I'll consider it.

StefanKarpinski commented 1 year ago

Looks like if the average cpu utilization is above 20% for many sequential intervals then they need a reboot. FWIW, in my experience the only way to build fully reliable systems is to detect bad states and restart.

StefanKarpinski commented 1 year ago

Also, in this case nginx is not something we really need to be in the business of ensuring the reliability of...

DilumAluthge commented 1 year ago

Should we switch from Nginx to something else?

staticfloat commented 1 year ago

I don't think this is nginx's fault, I'm pretty sure this is a kernel bug.

Keno commented 1 year ago

Looking in dmesg after a echo w > /proc/sysrq-trigger showed that they were stuck in the kernel during an fsync.

Do you have a full backtrace and also what's the filesytem?

staticfloat commented 1 year ago

Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189515] task:nginx           state:D stack:    0 pid:1248331 ppid:1248210 flags:0x00000000
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189518] Call Trace:
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189519]  <TASK>
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189520]  __schedule+0x2cd/0x8a0
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189523]  ? crypto_shash_update+0x1e/0x30
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189527]  schedule+0x4e/0xc0
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189530]  schedule_timeout+0x18f/0x2d0
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189533]  ? _raw_spin_unlock_irqrestore+0xe/0x30
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189536]  ? del_timer_sync+0x40/0x40
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189543]  io_schedule_timeout+0x1e/0x50
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189545]  balance_dirty_pages+0x35a/0xe60
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189549]  balance_dirty_pages_ratelimited+0x311/0x3e0
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189552]  generic_perform_write+0x138/0x1f0
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189556]  ext4_buffered_write_iter+0x98/0x160
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189561]  ext4_file_write_iter+0x53/0x230
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189565]  ? common_file_perm+0x72/0x180
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189567]  do_iter_readv_writev+0x152/0x1b0
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189570]  do_iter_write+0x8c/0x1d0
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189573]  vfs_iter_write+0x19/0x30
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189577]  ovl_write_iter+0x2d3/0x490 [overlay]
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189586]  do_iter_readv_writev+0x152/0x1b0
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189588]  do_iter_write+0x8c/0x1d0
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189590]  vfs_writev+0x83/0x150
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189593]  do_pwritev+0x97/0xe0
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189595]  __x64_sys_pwritev+0x21/0x30
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189597]  do_syscall_64+0x5c/0xc0
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189601]  ? __x64_sys_pwritev+0x21/0x30
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189602]  ? do_syscall_64+0x69/0xc0
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189604]  ? do_syscall_64+0x69/0xc0
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189607]  ? do_syscall_64+0x69/0xc0
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189609]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189613] RIP: 0033:0x7fa00ea0c829
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189616] RSP: 002b:00007ffde9500e18 EFLAGS: 00000246 ORIG_RAX: 0000000000000128
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189618] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fa00ea0c829
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189620] RDX: 0000000000000002 RSI: 00007ffde9500e40 RDI: 00000000000001da
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189622] RBP: 000055e6541b7058 R08: 0000000000000000 R09: 0000000000002000
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189623] R10: 0000000015338000 R11: 0000000000000246 R12: 0000000000000000
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189625] R13: 0000000015338000 R14: 0000000000000000 R15: 00007ffde9500e20
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189627]  </TASK>

The root filesystem is ext4, (which is reinforced by the stack trace above) but since it's running inside of docker, overlayfs likely is involved as well. I unfortunately do not have multiple stack traces to verify at which point in the stack above it was stuck.

Keno commented 1 year ago

I thought you said it was in fsync?

staticfloat commented 1 year ago

Indeed I did, but I must have crossed some wires, because it was actually a dpkg task that was stuck in fsync:

Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189635] task:dpkg            state:D stack:    0 pid:1251567 ppid:1250276 flags:0x00004000
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189637] Call Trace:
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189638]  <TASK>
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189639]  __schedule+0x2cd/0x8a0
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189642]  schedule+0x4e/0xc0
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189644]  io_schedule+0x16/0x40
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189647]  wait_on_page_bit_common+0x16e/0x3b0
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189651]  ? filemap_invalidate_unlock_two+0x50/0x50
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189654]  wait_on_page_bit+0x3f/0x50
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189657]  wait_on_page_writeback+0x26/0x80
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189659]  __filemap_fdatawait_range+0x9f/0x120
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189663]  ? do_writepages+0xc5/0x1c0
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189665]  ? __ext4_mark_inode_dirty+0x200/0x200
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189668]  ? do_writepages+0xc5/0x1c0
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189671]  file_write_and_wait_range+0xd3/0xf0
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189675]  ext4_sync_file+0xf6/0x320
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189677]  vfs_fsync_range+0x49/0x90
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189679]  do_fsync+0x3d/0x80
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189681]  ? __x64_sys_ftruncate+0x1b/0x30
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189686]  __x64_sys_fsync+0x14/0x20
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189688]  do_syscall_64+0x5c/0xc0
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189690]  ? do_syscall_64+0x69/0xc0
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189692]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189695] RIP: 0033:0x7f123a0781a7
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189697] RSP: 002b:00007ffd029d10b8 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189699] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f123a0781a7
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189700] RDX: 0000000000000000 RSI: 00000000000001ed RDI: 0000000000000004
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189701] RBP: 000056408be47fa0 R08: 0000000000000000 R09: 0000000000002800
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189702] R10: 000056408b9677b6 R11: 0000000000000246 R12: 000056408bde52a0
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189704] R13: 00000000000001ed R14: 000056408bdf6580 R15: 0000000000000000
Aug 28 18:32:38 loadbalancer-us-east kernel: [24799645.189706]  </TASK>

Keno commented 1 year ago

Ok, there are a couple of kernel bugs with backtraces that look like this, but nothing particularly recent. I would recommend a kernel upgrade and if it still happens, let me know next time and i can try to diagnose on a live system.

staticfloat commented 1 year ago

We upgraded from 5.15.0-1022-aws (the version where the bad stuff happened) to 5.15.0-1041-aws (the latest as of yesterday). If it happens again, I'll let you know.

StefanKarpinski commented 1 year ago

In my ideally world we would have monitoring that detects this, records whatever dmsg info it needs to diagnose this, notifies us (on #infrastructure?), and automatically reboots the machine. Or would it be better to keep this machine around for debugging and replace it with a new machine?

JuliaPackaging / PkgServer.jl

kernel ext4 lockup causing `nginx` slowdown #177