Open kakra opened 2 years ago
It doesn't look stuck, but it might be looping in LOGICAL_INO
long enough to mess with RCU (the kernel thread will more or less stop dead).
Does bees report detecting a toxic extent at the same time?
Does bees report detecting a toxic extent at the same time?
Yes, and it already did that since hours before. If you'd need a log of bees from right before the crash, I could send it to you. But I don't want to disclose log information from the server here.
It might be a normal LOGICAL_INO
long call time then.
Well, it stalled the whole server. Even ssh into the system did not work although none of the main system is on btrfs but xfs instead, nothing even touches btrfs on login, all components using btrfs are running inside containers. Sending ctrl+alt+del to the console also didn't work, we had to hard-reboot the system. I took the chance a few days ago updating the host system to latest packages and latest 5.10 LTS.
That seems a little extreme for LOGICAL_INO
, especially on post-5.7 kernels. Also, when I hit the LOGICAL_INO
slow cases I don't get complaints from RCU.
Maybe it's a separate RCU bug?
Got the same thing (I assume) on 5.14.17-gentoo
Last thing systemd logged before freezing
Bees was running in terminal, and yes, it seemed opening a chrome tab was the last thing the system did.
Nov 30 21:48:18 powertux kernel: rcu: INFO: rcu_sched self-detected stall on CPU
Nov 30 21:48:18 powertux kernel: rcu: 0-....: (20998 ticks this GP) idle=406/1/0x4000000000000000 softirq=4720614/4720614 fqs=5055
Nov 30 21:48:18 powertux kernel: (t=21001 jiffies g=10395189 q=475674)
Nov 30 21:48:18 powertux kernel: NMI backtrace for cpu 0
Nov 30 21:48:18 powertux kernel: CPU: 0 PID: 72955 Comm: chrome Tainted: P O 5.14.17-gentoo #1
Nov 30 21:48:18 powertux kernel: Hardware name: ASUS System Product Name/TUF GAMING H570-PRO WIFI, BIOS 0811 04/06/2021
Nov 30 21:48:18 powertux kernel: Call Trace:
Nov 30 21:48:18 powertux kernel: <IRQ>
Nov 30 21:48:18 powertux kernel: dump_stack_lvl+0x34/0x44
Nov 30 21:48:18 powertux kernel: nmi_cpu_backtrace.cold+0x32/0x69
Nov 30 21:48:18 powertux kernel: ? lapic_can_unplug_cpu+0x70/0x70
Nov 30 21:48:18 powertux kernel: nmi_trigger_cpumask_backtrace+0x7b/0x90
Nov 30 21:48:18 powertux kernel: rcu_dump_cpu_stacks+0xb0/0xde
Nov 30 21:48:18 powertux kernel: rcu_sched_clock_irq.cold+0xc4/0x1e6
Nov 30 21:48:18 powertux kernel: update_process_times+0x88/0xc0
Nov 30 21:48:18 powertux kernel: tick_sched_handle+0x2f/0x40
Nov 30 21:48:18 powertux kernel: tick_sched_timer+0x75/0xa0
Nov 30 21:48:18 powertux kernel: ? can_stop_idle_tick+0x80/0x80
Nov 30 21:48:18 powertux kernel: __hrtimer_run_queues+0x11b/0x250
Nov 30 21:48:18 powertux kernel: hrtimer_interrupt+0x10a/0x2b0
Nov 30 21:48:18 powertux kernel: __sysvec_apic_timer_interrupt+0x54/0xd0
Nov 30 21:48:18 powertux kernel: sysvec_apic_timer_interrupt+0x6d/0x90
Nov 30 21:48:18 powertux kernel: </IRQ>
Nov 30 21:48:18 powertux kernel: asm_sysvec_apic_timer_interrupt+0x12/0x20
Nov 30 21:48:18 powertux kernel: RIP: 0010:_nv028370rm+0x39/0x4e0 [nvidia]
Nov 30 21:48:18 powertux kernel: Code: 48 85 d2 74 2e 48 8b 4f 08 31 c0 48 85 c9 74 0d 48 63 41 14 48 89 d6 48 29 c6 48 89 f0 48 3b 57 18 48 89 07 74 1b 48 8b 42 08 <48> 89 47 10 b8 01 00 00 00 48 83 c4 08 c3 66 0f 1f 84 00 00 00 00
Nov 30 21:48:18 powertux kernel: RSP: 0018:ffffa3f282ff3ba8 EFLAGS: 00000287
Nov 30 21:48:18 powertux kernel: RAX: ffffe00cc64dddc8 RBX: ffff984b4d6d5830 RCX: ffff984ae8f48d80
Nov 30 21:48:18 powertux kernel: RDX: ffffe00cc665cb88 RSI: ffffe00cc665eb7c RDI: ffff984b1ef82d00
Nov 30 21:48:18 powertux kernel: RBP: ffff984b1ef82d00 R08: 0000000000000020 R09: ffff984b1ef82d08
Nov 30 21:48:18 powertux kernel: R10: ffff984b05b14008 R11: ffff98521bf27000 R12: ffff984ae0b1aa38
Nov 30 21:48:18 powertux kernel: R13: ffffe00ccbd3d5f4 R14: ffff984ae0b1aa38 R15: ffff9849c7917410
Nov 30 21:48:18 powertux kernel: ? _nv034902rm+0xa8/0xe0 [nvidia]
Nov 30 21:48:18 powertux kernel: ? _nv014538rm+0x31b/0x7f0 [nvidia]
Nov 30 21:48:18 powertux kernel: ? _nv035204rm+0xac/0xe0 [nvidia]
Nov 30 21:48:18 powertux kernel: ? _nv036727rm+0xb0/0x140 [nvidia]
Nov 30 21:48:18 powertux kernel: ? _nv036726rm+0x30f/0x4f0 [nvidia]
Nov 30 21:48:18 powertux kernel: ? _nv036721rm+0x60/0x70 [nvidia]
Nov 30 21:48:18 powertux kernel: ? _nv036722rm+0x7b/0xb0 [nvidia]
Nov 30 21:48:18 powertux kernel: ? _nv035112rm+0x40/0xe0 [nvidia]
Nov 30 21:48:18 powertux kernel: ? _nv000693rm+0x68/0x80 [nvidia]
Nov 30 21:48:18 powertux kernel: ? rm_cleanup_file_private+0xea/0x170 [nvidia]
Nov 30 21:48:18 powertux kernel: ? nvidia_dev_put+0xa9b/0xc50 [nvidia]
Nov 30 21:48:18 powertux kernel: ? nvidia_frontend_close+0x23/0x40 [nvidia]
Nov 30 21:48:18 powertux kernel: ? __fput+0x84/0x230
Nov 30 21:48:18 powertux kernel: ? task_work_run+0x54/0x90
Nov 30 21:48:18 powertux kernel: ? do_exit+0x33a/0xa00
Nov 30 21:48:18 powertux kernel: ? do_group_exit+0x2e/0x90
Nov 30 21:48:18 powertux kernel: ? __x64_sys_exit_group+0xf/0x10
Nov 30 21:48:18 powertux kernel: ? do_syscall_64+0x38/0x90
Nov 30 21:48:18 powertux kernel: ? entry_SYSCALL_64_after_hwframe+0x44/0xae
Nov 30 21:49:21 powertux kernel: rcu: INFO: rcu_sched self-detected stall on CPU
Nov 30 21:49:21 powertux kernel: rcu: 0-....: (84001 ticks this GP) idle=406/1/0x4000000000000000 softirq=4720614/4720614 fqs=20332
Nov 30 21:49:21 powertux kernel: (t=84003 jiffies g=10395189 q=811405)
Nov 30 21:49:21 powertux kernel: NMI backtrace for cpu 0
Nov 30 21:49:21 powertux kernel: CPU: 0 PID: 72955 Comm: chrome Tainted: P O 5.14.17-gentoo #1
Nov 30 21:49:21 powertux kernel: Hardware name: ASUS System Product Name/TUF GAMING H570-PRO WIFI, BIOS 0811 04/06/2021
Nov 30 21:49:21 powertux kernel: Call Trace:
Nov 30 21:49:21 powertux kernel: <IRQ>
Nov 30 21:49:21 powertux kernel: dump_stack_lvl+0x34/0x44
Nov 30 21:49:21 powertux kernel: nmi_cpu_backtrace.cold+0x32/0x69
I'm giving this the kernel bug label because it obviously is one. Keep the stack traces coming, maybe we can hit something a kernel dev can use. Alternatively, send complete blocked-task traces (e.g. by alt-sysrq-w) to the linux-btrfs mailing list directly.
In the last log: nvidia... Do you use multiple monitors?
No. Single Monitor.
Happened on our server this night, the bees crawler is involved. Known bug? (I know this isn't the latest LTS kernel version, but I won't update that before next weekend)
If you need more information, I can send the full kernel logs pm. This was just the first incident of many.