Zygo / bees

Best-Effort Extent-Same, a btrfs dedupe agent
GNU General Public License v3.0
630 stars 57 forks source link

rcu_sched self-detected stall on CPU (bees thread "task:crawl_340") #191

Open kakra opened 2 years ago

kakra commented 2 years ago

Happened on our server this night, the bees crawler is involved. Known bug? (I know this isn't the latest LTS kernel version, but I won't update that before next weekend)

Sep 06 04:04:34 vch01 kernel: rcu: INFO: rcu_sched self-detected stall on CPU
Sep 06 04:04:34 vch01 kernel: rcu:         2-....: (2100 ticks this GP) idle=d0a/1/0x4000000000000000 softirq=636803192/636803192 fqs=903 
Sep 06 04:04:34 vch01 kernel:         (t=2144 jiffies g=1622072605 q=1390315)
Sep 06 04:04:34 vch01 kernel: NMI backtrace for cpu 2
Sep 06 04:04:34 vch01 kernel: CPU: 2 PID: 212331 Comm: crawl_340 Tainted: G        W         5.10.27-gentoo #1
Sep 06 04:04:34 vch01 kernel: Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
Sep 06 04:04:34 vch01 kernel: Call Trace:
Sep 06 04:04:34 vch01 kernel:  <IRQ>
Sep 06 04:04:34 vch01 kernel:  dump_stack+0x58/0x6b
Sep 06 04:04:34 vch01 kernel:  nmi_cpu_backtrace.cold+0x32/0x69
Sep 06 04:04:34 vch01 kernel:  ? lapic_can_unplug_cpu+0x80/0x80
Sep 06 04:04:34 vch01 kernel:  nmi_trigger_cpumask_backtrace+0x7c/0x90
Sep 06 04:04:34 vch01 kernel:  rcu_dump_cpu_stacks+0x92/0xc0
Sep 06 04:04:34 vch01 kernel:  rcu_sched_clock_irq.cold+0x18b/0x3c7
Sep 06 04:04:34 vch01 kernel:  update_process_times+0x87/0xc0
Sep 06 04:04:34 vch01 kernel:  tick_sched_timer+0xa5/0xf0
Sep 06 04:04:34 vch01 kernel:  ? tick_nohz_handler+0xf0/0xf0
Sep 06 04:04:34 vch01 kernel:  __hrtimer_run_queues+0x10b/0x1b0
Sep 06 04:04:34 vch01 kernel:  hrtimer_interrupt+0x10b/0x2c0
Sep 06 04:04:34 vch01 kernel:  __sysvec_apic_timer_interrupt+0x47/0x60
Sep 06 04:04:34 vch01 kernel:  asm_call_irq_on_stack+0xf/0x20
Sep 06 04:04:34 vch01 kernel:  </IRQ>
Sep 06 04:04:34 vch01 kernel:  sysvec_apic_timer_interrupt+0x6a/0x80
Sep 06 04:04:34 vch01 kernel:  asm_sysvec_apic_timer_interrupt+0x12/0x20
Sep 06 04:04:34 vch01 kernel: RIP: 0010:generic_bin_search.constprop.0+0xe7/0x110
Sep 06 04:04:34 vch01 kernel: Code: 5c 41 5d 41 5e 41 5f c3 41 89 dc 45 39 e5 0f 8c 61 ff ff ff 48 8b 04 24 44 89 28 48 83 c4 20 b8 01 00 00 00 5b 5d 41 5c 41 5d <41> 5e 41 5f c3 44 8d 6b 01 eb d6 48 8d 74 24 0f b9 11 00 00 00 48
Sep 06 04:04:34 vch01 kernel: RSP: 0018:ffffa33741bb3b38 EFLAGS: 00000286
Sep 06 04:04:34 vch01 kernel: RAX: 0000000000000001 RBX: 0000000000000001 RCX: 0000000000000e24
Sep 06 04:04:34 vch01 kernel: RDX: 0000000000000003 RSI: 00000000008978da RDI: 00000003d9e22000
Sep 06 04:04:34 vch01 kernel: RBP: 0000000000000001 R08: ffffa33741bb3b94 R09: 20006c0000000000
Sep 06 04:04:34 vch01 kernel: R10: 00000000000fad3e R11: ad3e0000003cf2a6 R12: ffff8887da16a620
Sep 06 04:04:34 vch01 kernel: R13: ffffa33741bb3bf7 R14: ffffa33741bb3bf7 R15: 0000000000000021
Sep 06 04:04:34 vch01 kernel:  btrfs_search_old_slot+0x371/0x860
Sep 06 04:04:34 vch01 kernel:  btrfs_next_old_leaf+0xaf/0x3e0
Sep 06 04:04:34 vch01 kernel:  find_parent_nodes+0x182e/0x1b20
Sep 06 04:04:34 vch01 kernel:  btrfs_find_all_leafs+0x5a/0xa0
Sep 06 04:04:34 vch01 kernel:  iterate_extent_inodes+0x9f/0x210
Sep 06 04:04:34 vch01 kernel:  ? btrfs_inode_flags_to_xflags+0x50/0x50
Sep 06 04:04:34 vch01 kernel:  ? btrfs_get_64+0x5a/0x100
Sep 06 04:04:34 vch01 kernel:  iterate_inodes_from_logical+0x89/0xb0
Sep 06 04:04:34 vch01 kernel:  ? btrfs_inode_flags_to_xflags+0x50/0x50
Sep 06 04:04:34 vch01 kernel:  btrfs_ioctl_logical_to_ino+0xaf/0x180
Sep 06 04:04:34 vch01 kernel:  __x64_sys_ioctl+0x7b/0xb0
Sep 06 04:04:34 vch01 kernel:  do_syscall_64+0x2d/0x40
Sep 06 04:04:34 vch01 kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Sep 06 04:04:34 vch01 kernel: RIP: 0033:0x7fdc0a481347
Sep 06 04:04:34 vch01 kernel: Code: 89 d8 48 f7 d8 49 39 c4 72 b9 e8 24 ff ff ff 85 c0 78 be 5b 4c 89 e0 5d 41 5c c3 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d f1 4a 0c 00 f7 d8 64 89 01 48
Sep 06 04:04:34 vch01 kernel: RSP: 002b:00007fdc09386968 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Sep 06 04:04:34 vch01 kernel: RAX: ffffffffffffffda RBX: 000055d18044a650 RCX: 00007fdc0a481347
Sep 06 04:04:34 vch01 kernel: RDX: 00007fdc09386d48 RSI: 00000000c038943b RDI: 0000000000000003
Sep 06 04:04:34 vch01 kernel: RBP: 00007fdc09386d40 R08: 00007fdba04d5610 R09: 00007fdc09386e08
Sep 06 04:04:34 vch01 kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003
Sep 06 04:04:34 vch01 kernel: R13: 00007fdc09386d48 R14: 00007fdc09386c60 R15: 00007fdc09386b98
Sep 06 04:04:35 vch01 kernel: rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 2-... } 2150 jiffies s: 744661 root: 0x4/.
Sep 06 04:04:35 vch01 kernel: rcu: blocking rcu_node structures:
Sep 06 04:04:35 vch01 kernel: Task dump for CPU 2:
Sep 06 04:04:35 vch01 kernel: task:crawl_340       state:R  running task     stack:    0 pid:212331 ppid:     1 flags:0x00004008
Sep 06 04:04:35 vch01 kernel: Call Trace:
Sep 06 04:04:35 vch01 kernel:  ? read_extent_buffer+0x84/0xa0
Sep 06 04:04:35 vch01 kernel:  ? __alloc_pages_nodemask+0x126/0x260
Sep 06 04:04:35 vch01 kernel:  ? btrfs_set_64+0x5c/0x100
Sep 06 04:04:35 vch01 kernel:  ? write_extent_buffer+0x86/0xc0
Sep 06 04:04:35 vch01 kernel:  ? __tree_mod_log_rewind+0xd9/0x1f0
Sep 06 04:04:35 vch01 kernel:  ? btrfs_search_old_slot+0x2d4/0x860
Sep 06 04:04:35 vch01 kernel:  ? btrfs_next_old_leaf+0xaf/0x3e0
Sep 06 04:04:35 vch01 kernel:  ? find_parent_nodes+0x182e/0x1b20
Sep 06 04:04:35 vch01 kernel:  ? btrfs_find_all_leafs+0x5a/0xa0
Sep 06 04:04:35 vch01 kernel:  ? iterate_extent_inodes+0x9f/0x210
Sep 06 04:04:35 vch01 kernel:  ? btrfs_inode_flags_to_xflags+0x50/0x50
Sep 06 04:04:35 vch01 kernel:  ? btrfs_get_64+0x5a/0x100
Sep 06 04:04:35 vch01 kernel:  ? iterate_inodes_from_logical+0x89/0xb0
Sep 06 04:04:35 vch01 kernel:  ? btrfs_inode_flags_to_xflags+0x50/0x50
Sep 06 04:04:35 vch01 kernel:  ? btrfs_ioctl_logical_to_ino+0xaf/0x180
Sep 06 04:04:35 vch01 kernel:  ? __x64_sys_ioctl+0x7b/0xb0
Sep 06 04:04:35 vch01 kernel:  ? do_syscall_64+0x2d/0x40
Sep 06 04:04:35 vch01 kernel:  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9

If you need more information, I can send the full kernel logs pm. This was just the first incident of many.

Zygo commented 2 years ago

It doesn't look stuck, but it might be looping in LOGICAL_INO long enough to mess with RCU (the kernel thread will more or less stop dead).

Does bees report detecting a toxic extent at the same time?

kakra commented 2 years ago

Does bees report detecting a toxic extent at the same time?

Yes, and it already did that since hours before. If you'd need a log of bees from right before the crash, I could send it to you. But I don't want to disclose log information from the server here.

Zygo commented 2 years ago

It might be a normal LOGICAL_INO long call time then.

kakra commented 2 years ago

Well, it stalled the whole server. Even ssh into the system did not work although none of the main system is on btrfs but xfs instead, nothing even touches btrfs on login, all components using btrfs are running inside containers. Sending ctrl+alt+del to the console also didn't work, we had to hard-reboot the system. I took the chance a few days ago updating the host system to latest packages and latest 5.10 LTS.

Zygo commented 2 years ago

That seems a little extreme for LOGICAL_INO, especially on post-5.7 kernels. Also, when I hit the LOGICAL_INO slow cases I don't get complaints from RCU.

Maybe it's a separate RCU bug?

daugustin commented 2 years ago

Got the same thing (I assume) on 5.14.17-gentoo

Last thing systemd logged before freezing

Bees was running in terminal, and yes, it seemed opening a chrome tab was the last thing the system did.

Nov 30 21:48:18 powertux kernel: rcu: INFO: rcu_sched self-detected stall on CPU
Nov 30 21:48:18 powertux kernel: rcu:         0-....: (20998 ticks this GP) idle=406/1/0x4000000000000000 softirq=4720614/4720614 fqs=5055 
Nov 30 21:48:18 powertux kernel:         (t=21001 jiffies g=10395189 q=475674)
Nov 30 21:48:18 powertux kernel: NMI backtrace for cpu 0
Nov 30 21:48:18 powertux kernel: CPU: 0 PID: 72955 Comm: chrome Tainted: P           O      5.14.17-gentoo #1
Nov 30 21:48:18 powertux kernel: Hardware name: ASUS System Product Name/TUF GAMING H570-PRO WIFI, BIOS 0811 04/06/2021
Nov 30 21:48:18 powertux kernel: Call Trace:
Nov 30 21:48:18 powertux kernel:  <IRQ>
Nov 30 21:48:18 powertux kernel:  dump_stack_lvl+0x34/0x44
Nov 30 21:48:18 powertux kernel:  nmi_cpu_backtrace.cold+0x32/0x69
Nov 30 21:48:18 powertux kernel:  ? lapic_can_unplug_cpu+0x70/0x70
Nov 30 21:48:18 powertux kernel:  nmi_trigger_cpumask_backtrace+0x7b/0x90
Nov 30 21:48:18 powertux kernel:  rcu_dump_cpu_stacks+0xb0/0xde
Nov 30 21:48:18 powertux kernel:  rcu_sched_clock_irq.cold+0xc4/0x1e6
Nov 30 21:48:18 powertux kernel:  update_process_times+0x88/0xc0
Nov 30 21:48:18 powertux kernel:  tick_sched_handle+0x2f/0x40
Nov 30 21:48:18 powertux kernel:  tick_sched_timer+0x75/0xa0
Nov 30 21:48:18 powertux kernel:  ? can_stop_idle_tick+0x80/0x80
Nov 30 21:48:18 powertux kernel:  __hrtimer_run_queues+0x11b/0x250
Nov 30 21:48:18 powertux kernel:  hrtimer_interrupt+0x10a/0x2b0
Nov 30 21:48:18 powertux kernel:  __sysvec_apic_timer_interrupt+0x54/0xd0
Nov 30 21:48:18 powertux kernel:  sysvec_apic_timer_interrupt+0x6d/0x90
Nov 30 21:48:18 powertux kernel:  </IRQ>
Nov 30 21:48:18 powertux kernel:  asm_sysvec_apic_timer_interrupt+0x12/0x20
Nov 30 21:48:18 powertux kernel: RIP: 0010:_nv028370rm+0x39/0x4e0 [nvidia]
Nov 30 21:48:18 powertux kernel: Code: 48 85 d2 74 2e 48 8b 4f 08 31 c0 48 85 c9 74 0d 48 63 41 14 48 89 d6 48 29 c6 48 89 f0 48 3b 57 18 48 89 07 74 1b 48 8b 42 08 <48> 89 47 10 b8 01 00 00 00 48 83 c4 08 c3 66 0f 1f 84 00 00 00 00
Nov 30 21:48:18 powertux kernel: RSP: 0018:ffffa3f282ff3ba8 EFLAGS: 00000287
Nov 30 21:48:18 powertux kernel: RAX: ffffe00cc64dddc8 RBX: ffff984b4d6d5830 RCX: ffff984ae8f48d80
Nov 30 21:48:18 powertux kernel: RDX: ffffe00cc665cb88 RSI: ffffe00cc665eb7c RDI: ffff984b1ef82d00
Nov 30 21:48:18 powertux kernel: RBP: ffff984b1ef82d00 R08: 0000000000000020 R09: ffff984b1ef82d08
Nov 30 21:48:18 powertux kernel: R10: ffff984b05b14008 R11: ffff98521bf27000 R12: ffff984ae0b1aa38
Nov 30 21:48:18 powertux kernel: R13: ffffe00ccbd3d5f4 R14: ffff984ae0b1aa38 R15: ffff9849c7917410
Nov 30 21:48:18 powertux kernel:  ? _nv034902rm+0xa8/0xe0 [nvidia]
Nov 30 21:48:18 powertux kernel:  ? _nv014538rm+0x31b/0x7f0 [nvidia]
Nov 30 21:48:18 powertux kernel:  ? _nv035204rm+0xac/0xe0 [nvidia]
Nov 30 21:48:18 powertux kernel:  ? _nv036727rm+0xb0/0x140 [nvidia]
Nov 30 21:48:18 powertux kernel:  ? _nv036726rm+0x30f/0x4f0 [nvidia]
Nov 30 21:48:18 powertux kernel:  ? _nv036721rm+0x60/0x70 [nvidia]
Nov 30 21:48:18 powertux kernel:  ? _nv036722rm+0x7b/0xb0 [nvidia]
Nov 30 21:48:18 powertux kernel:  ? _nv035112rm+0x40/0xe0 [nvidia]
Nov 30 21:48:18 powertux kernel:  ? _nv000693rm+0x68/0x80 [nvidia]
Nov 30 21:48:18 powertux kernel:  ? rm_cleanup_file_private+0xea/0x170 [nvidia]
Nov 30 21:48:18 powertux kernel:  ? nvidia_dev_put+0xa9b/0xc50 [nvidia]
Nov 30 21:48:18 powertux kernel:  ? nvidia_frontend_close+0x23/0x40 [nvidia]
Nov 30 21:48:18 powertux kernel:  ? __fput+0x84/0x230
Nov 30 21:48:18 powertux kernel:  ? task_work_run+0x54/0x90
Nov 30 21:48:18 powertux kernel:  ? do_exit+0x33a/0xa00
Nov 30 21:48:18 powertux kernel:  ? do_group_exit+0x2e/0x90
Nov 30 21:48:18 powertux kernel:  ? __x64_sys_exit_group+0xf/0x10
Nov 30 21:48:18 powertux kernel:  ? do_syscall_64+0x38/0x90
Nov 30 21:48:18 powertux kernel:  ? entry_SYSCALL_64_after_hwframe+0x44/0xae
Nov 30 21:49:21 powertux kernel: rcu: INFO: rcu_sched self-detected stall on CPU
Nov 30 21:49:21 powertux kernel: rcu:         0-....: (84001 ticks this GP) idle=406/1/0x4000000000000000 softirq=4720614/4720614 fqs=20332 
Nov 30 21:49:21 powertux kernel:         (t=84003 jiffies g=10395189 q=811405)
Nov 30 21:49:21 powertux kernel: NMI backtrace for cpu 0
Nov 30 21:49:21 powertux kernel: CPU: 0 PID: 72955 Comm: chrome Tainted: P           O      5.14.17-gentoo #1
Nov 30 21:49:21 powertux kernel: Hardware name: ASUS System Product Name/TUF GAMING H570-PRO WIFI, BIOS 0811 04/06/2021
Nov 30 21:49:21 powertux kernel: Call Trace:
Nov 30 21:49:21 powertux kernel:  <IRQ>
Nov 30 21:49:21 powertux kernel:  dump_stack_lvl+0x34/0x44
Nov 30 21:49:21 powertux kernel:  nmi_cpu_backtrace.cold+0x32/0x69
Zygo commented 2 years ago

I'm giving this the kernel bug label because it obviously is one. Keep the stack traces coming, maybe we can hit something a kernel dev can use. Alternatively, send complete blocked-task traces (e.g. by alt-sysrq-w) to the linux-btrfs mailing list directly.

kakra commented 2 years ago

In the last log: nvidia... Do you use multiple monitors?

daugustin commented 2 years ago

No. Single Monitor.