bees stuck with 100% CPU usage, system needs sysrq+reisub

kakra commented 2 years ago

This night we encountered the following problem and found bees stuck with using 100% of one core:

Feb 17 03:29:48 vch01 kernel: rcu: INFO: rcu_sched self-detected stall on CPU
Feb 17 03:29:48 vch01 kernel: rcu:         6-....: (2100 ticks this GP) idle=8e5/1/0x4000000000000000 softirq=32979221/32979221 fqs=898
Feb 17 03:29:48 vch01 kernel:         (t=2102 jiffies g=91829349 q=5073689)
Feb 17 03:29:48 vch01 kernel: NMI backtrace for cpu 6
Feb 17 03:29:48 vch01 kernel: CPU: 6 PID: 3265668 Comm: crawl_340 Not tainted 5.15.11-gentoo #1
Feb 17 03:29:48 vch01 kernel: Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
Feb 17 03:29:48 vch01 kernel: Call Trace:
Feb 17 03:29:48 vch01 kernel:  <IRQ>
Feb 17 03:29:48 vch01 kernel:  dump_stack_lvl+0x34/0x44
Feb 17 03:29:48 vch01 kernel:  nmi_cpu_backtrace.cold+0x30/0x70
Feb 17 03:29:48 vch01 kernel:  ? lapic_can_unplug_cpu+0x80/0x80
Feb 17 03:29:48 vch01 kernel:  nmi_trigger_cpumask_backtrace+0x7c/0x90
Feb 17 03:29:48 vch01 kernel:  rcu_dump_cpu_stacks+0xb7/0xe5
Feb 17 03:29:48 vch01 kernel:  rcu_sched_clock_irq.cold+0x281/0x442
Feb 17 03:29:48 vch01 kernel:  ? cgroup_rstat_updated+0x39/0xc0
Feb 17 03:29:48 vch01 kernel:  update_process_times+0x87/0xc0
Feb 17 03:29:48 vch01 kernel:  tick_sched_timer+0xa3/0xd0
Feb 17 03:29:48 vch01 kernel:  ? tick_nohz_handler+0xe0/0xe0
Feb 17 03:29:48 vch01 kernel:  __hrtimer_run_queues+0x10b/0x1b0
Feb 17 03:29:48 vch01 kernel:  hrtimer_interrupt+0x109/0x230
Feb 17 03:29:48 vch01 kernel:  __sysvec_apic_timer_interrupt+0x47/0x60
Feb 17 03:29:48 vch01 kernel:  sysvec_apic_timer_interrupt+0x65/0x90
Feb 17 03:29:48 vch01 kernel:  </IRQ>
Feb 17 03:29:48 vch01 kernel:  <TASK>
Feb 17 03:29:48 vch01 kernel:  asm_sysvec_apic_timer_interrupt+0x12/0x20
Feb 17 03:29:48 vch01 kernel: RIP: 0010:find_parent_nodes+0x1836/0x1b30
Feb 17 03:29:48 vch01 kernel: Code: 48 89 df e8 7c 5c fb ff ba 15 00 00 00 48 89 df 41 89 c2 4d 8d 72 65 4c 89 f6 e8 05 61 fb ff ba 25 00 00 00 4c 89 f6 48 89 df <48> 89 44 24 48 e8 f0 60 fb ff 48 89 c2 48 8b 44 24 48 48 39 44 24
Feb 17 03:29:48 vch01 kernel: RSP: 0018:ffffad3b83163c20 EFLAGS: 00000282
Feb 17 03:29:48 vch01 kernel: RAX: 0000003f17d05000 RBX: ffff9c5b8cf1cf00 RCX: 0000000000000008
Feb 17 03:29:48 vch01 kernel: RDX: 0000000000000025 RSI: 0000000000003822 RDI: ffff9c5b8cf1cf00
Feb 17 03:29:48 vch01 kernel: RBP: ffff9c5bcfeec380 R08: 00000000000000c1 R09: 0000000003cf23dc
Feb 17 03:29:48 vch01 kernel: R10: 0000000000003822 R11: 0000000000000000 R12: ffff9c5b1d5bac60
Feb 17 03:29:48 vch01 kernel: R13: ffff9c6068acd268 R14: 0000000000003822 R15: ffff9c5ad6408800
Feb 17 03:29:48 vch01 kernel:  btrfs_find_all_leafs+0x5a/0xa0
Feb 17 03:29:48 vch01 kernel:  iterate_extent_inodes+0xac/0x250
Feb 17 03:29:48 vch01 kernel:  ? btrfs_flush_workqueue+0x30/0x30
Feb 17 03:29:48 vch01 kernel:  ? extent_from_logical+0x123/0x170
Feb 17 03:29:48 vch01 kernel:  ? iterate_inodes_from_logical+0x89/0xb0
Feb 17 03:29:48 vch01 kernel:  iterate_inodes_from_logical+0x89/0xb0
Feb 17 03:29:48 vch01 kernel:  ? btrfs_flush_workqueue+0x30/0x30
Feb 17 03:29:48 vch01 kernel:  btrfs_ioctl_logical_to_ino+0xab/0x170
Feb 17 03:29:48 vch01 kernel:  __x64_sys_ioctl+0x7a/0xb0
Feb 17 03:29:48 vch01 kernel:  do_syscall_64+0x35/0x80
Feb 17 03:29:48 vch01 kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xae
Feb 17 03:29:48 vch01 kernel: RIP: 0033:0x7f69d8b5f857
Feb 17 03:29:48 vch01 kernel: Code: 3c 1c e8 2c ff ff ff 85 c0 79 97 49 c7 c4 ff ff ff ff 5b 4c 89 e0 5d 41 5c c3 66 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e9 45 0c 00 f7 d8 64 89 01 48
Feb 17 03:29:48 vch01 kernel: RSP: 002b:00007f69d3a5a5d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Feb 17 03:29:48 vch01 kernel: RAX: ffffffffffffffda RBX: 0000560014a44a70 RCX: 00007f69d8b5f857
Feb 17 03:29:48 vch01 kernel: RDX: 00007f69d3a5a9b8 RSI: 00000000c038943b RDI: 0000000000000003
Feb 17 03:29:48 vch01 kernel: RBP: 00007f69d3a5a9b0 R08: 0000000000000000 R09: 00007f69d8f9ff00
Feb 17 03:29:48 vch01 kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003
Feb 17 03:29:48 vch01 kernel: R13: 00007f69d3a5a9b8 R14: 00007f69d3a5a9b0 R15: 00007f69d3a5ac98
Feb 17 03:29:48 vch01 kernel:  </TASK>

Tasks could not be killed, so I used s,u,b on /proc/sysrq-trigger to remotely reboot the system.

It's now running with kernel 5.15.23, the error occurred on 5.15.11.

Do you know if this particular problem is fixed in the kernel? Otherwise, I'll leave it for reference here until we encounter it again.

kakra commented 2 years ago

It happened again today with kernel 5.15.23 but this time we could not capture a full backtrace:

Mär 23 07:32:34 vch01 kernel: rcu: INFO: rcu_sched self-detected stall on CPU
Mär 23 07:32:34 vch01 kernel: rcu:         9-....: (27316 ticks this GP) idle=48d/1/0x4000000000000000 softirq=131340407/131340407 fqs=11357
Mär 23 07:32:34 vch01 kernel:         (t=27317 jiffies g=368663889 q=20046816)
Mär 23 07:32:34 vch01 kernel: NMI backtrace for cpu 9
Mär 23 07:32:34 vch01 kernel: CPU: 9 PID: 2237171 Comm: crawl_340 Not tainted 5.15.23-gentoo #1
Mär 23 07:32:34 vch01 kernel: Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
Mär 23 07:32:34 vch01 kernel: Call Trace:
Mär 23 07:32:34 vch01 kernel:  <IRQ>
[log ends here]

Zygo commented 2 years ago

The stack traces don't always appear. I have to run things like

 while :; do cat /proc/*/task/2237171/stack; done

to eventually get traces. It looks like it's looping almost all the way back to userspace, but instead of returning, it goes back in for another loop.

kakra commented 2 years ago

I won't be able to get that because the system has been rebooted before I knew about the problem. But a reliable reproducer seems to be to copy 100+ GB of new data to the server, then after a few hours of bees crunching through that, it will eventually RCU stall. Besides that, the server is loaded with light to medium web server load (PHP applications). The problem has happened both times during the backup window (borg backup) and snapper is taking hourly snapshots with retention policy (keeping around 35 snapshots):

# limits for timeline cleanup
TIMELINE_MIN_AGE="1800"
TIMELINE_LIMIT_HOURLY="11"
TIMELINE_LIMIT_DAILY="7"
TIMELINE_LIMIT_WEEKLY="5"
TIMELINE_LIMIT_MONTHLY="3"
TIMELINE_LIMIT_YEARLY="0"

(this is mainly to prevent accidental deletes by our web developers, they can easily recover files from snapshots, it only snaps the mostly static web site storage)

I'm queuing an update to kernel 5.15.26 now.

Zygo commented 2 years ago

It doesn't seem to be a new problem--I'm able to reproduce it on 5.9 and later kernels if I increase the worker thread count to 30 or so.

Reducing the worker thread count to 1 seems to avoid the problem (or at least dramatically improve the incidence rate, since any other write could be triggering the same kernel bug).

Elkropac commented 2 years ago

Hi, is this somehow related to issue i reported here? https://lore.kernel.org/linux-btrfs/c9f1640177563f545ef70eb6ec1560faa1bb1bd7.camel@bcom.cz/

If so, could it be mitigated by running 1 thread?

Zygo commented 2 years ago

bees -c1 (run only one worker thread) seems to be an effective workaround. I've applied it on some busy fileservers and they have been running uninterrupted 3+ months.

It is likely the same issue. The specific symptoms are: If you run top in 'threads' mode (press shift-H) there should be only one thread, pegged at 99-100% kernel ('sys') CPU. The process cannot be terminated by SIGKILL. Also the filesystem will be locked up, i.e. any write will hang.

Rarely, there may be 2-4 threads all running at 100% in the kernel, instead of just one. So far I have no reason to believe that this case is a different bug, but it's important to match the symptoms exactly in order to differentiate between distinct issues.

I've been searching backwards through kernel revisions. So far I have these test results:

all releases between 5.7 and 5.14 hang
now testing 5.6 with no hang yet
no hang observed on 5.4 (it was in production for some time last year)
have not tested 5.15 through 5.17 for this bug
will test 5.5 next if 5.6 testing ends with a hang

Elkropac commented 2 years ago

Hi, thanks for your explanation and continued research.

I will try running beesd with one thread at one customer for testing purposes

Elkropac commented 2 years ago

Hi, it seems to be working, it's up for 4 days and still going.

Seems to be a litte slow tho, cpu usage is around 33%, disk read speed is around 10MB/s. I know there can be 300-400MB/s reads when the backup application does it's job.

Is there a away to speed it up?

Zygo commented 2 years ago

Increasing the number of threads will make the current code run faster, but then of course you hit the kernel lockup bug faster too.

Elkropac commented 2 years ago

I think, i will stick with slower, but not locked up ;)

I compiled latest version, i can see periods of time when it runs about 40MB/s. Still, we have around 26TB of (zstd compressed) data on disk...

imsodin commented 1 year ago

Afaik the following recent doc updates indicate this -c 1 workaround to avoid the freezes isn't needed any more - can you confirm that?
https://github.com/Zygo/bees/commit/3d5ebe4d4094955b4eee767f415f27d81adbc4b7

I am just recovering (hopefully recovered) from some elusive hardware issues resulting in freezes, so I'll wait for a week or two of no freezes before trying out higher parallelism :)

Zygo commented 1 year ago

Afaik the following recent doc updates indicate this -c 1 workaround to avoid the freezes isn't needed any more - can you confirm that?

Yes. Commit a2e1887c525c3c2ef3e8daeb787ccb21f255eff7 prevents bees from triggering the bug, regardless of the number of threads bees runs.

The kernel bug still exists, but -c1 and the new workaround in the code have the same (low) risk of triggering it.

Zygo / bees

bees stuck with 100% CPU usage, system needs sysrq+reisub #215