Open kakra opened 2 years ago
It happened again today with kernel 5.15.23 but this time we could not capture a full backtrace:
Mär 23 07:32:34 vch01 kernel: rcu: INFO: rcu_sched self-detected stall on CPU
Mär 23 07:32:34 vch01 kernel: rcu: 9-....: (27316 ticks this GP) idle=48d/1/0x4000000000000000 softirq=131340407/131340407 fqs=11357
Mär 23 07:32:34 vch01 kernel: (t=27317 jiffies g=368663889 q=20046816)
Mär 23 07:32:34 vch01 kernel: NMI backtrace for cpu 9
Mär 23 07:32:34 vch01 kernel: CPU: 9 PID: 2237171 Comm: crawl_340 Not tainted 5.15.23-gentoo #1
Mär 23 07:32:34 vch01 kernel: Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
Mär 23 07:32:34 vch01 kernel: Call Trace:
Mär 23 07:32:34 vch01 kernel: <IRQ>
[log ends here]
The stack traces don't always appear. I have to run things like
while :; do cat /proc/*/task/2237171/stack; done
to eventually get traces. It looks like it's looping almost all the way back to userspace, but instead of returning, it goes back in for another loop.
I won't be able to get that because the system has been rebooted before I knew about the problem. But a reliable reproducer seems to be to copy 100+ GB of new data to the server, then after a few hours of bees crunching through that, it will eventually RCU stall. Besides that, the server is loaded with light to medium web server load (PHP applications). The problem has happened both times during the backup window (borg backup) and snapper is taking hourly snapshots with retention policy (keeping around 35 snapshots):
# limits for timeline cleanup
TIMELINE_MIN_AGE="1800"
TIMELINE_LIMIT_HOURLY="11"
TIMELINE_LIMIT_DAILY="7"
TIMELINE_LIMIT_WEEKLY="5"
TIMELINE_LIMIT_MONTHLY="3"
TIMELINE_LIMIT_YEARLY="0"
(this is mainly to prevent accidental deletes by our web developers, they can easily recover files from snapshots, it only snaps the mostly static web site storage)
I'm queuing an update to kernel 5.15.26 now.
It doesn't seem to be a new problem--I'm able to reproduce it on 5.9 and later kernels if I increase the worker thread count to 30 or so.
Reducing the worker thread count to 1 seems to avoid the problem (or at least dramatically improve the incidence rate, since any other write could be triggering the same kernel bug).
Hi, is this somehow related to issue i reported here? https://lore.kernel.org/linux-btrfs/c9f1640177563f545ef70eb6ec1560faa1bb1bd7.camel@bcom.cz/
If so, could it be mitigated by running 1 thread?
bees -c1
(run only one worker thread) seems to be an effective workaround. I've applied it on some busy fileservers and they have been running uninterrupted 3+ months.
It is likely the same issue. The specific symptoms are: If you run top in 'threads' mode (press shift-H) there should be only one thread, pegged at 99-100% kernel ('sys') CPU. The process cannot be terminated by SIGKILL. Also the filesystem will be locked up, i.e. any write will hang.
Rarely, there may be 2-4 threads all running at 100% in the kernel, instead of just one. So far I have no reason to believe that this case is a different bug, but it's important to match the symptoms exactly in order to differentiate between distinct issues.
I've been searching backwards through kernel revisions. So far I have these test results:
Hi, thanks for your explanation and continued research.
I will try running beesd with one thread at one customer for testing purposes
Hi, it seems to be working, it's up for 4 days and still going.
Seems to be a litte slow tho, cpu usage is around 33%, disk read speed is around 10MB/s. I know there can be 300-400MB/s reads when the backup application does it's job.
Is there a away to speed it up?
Increasing the number of threads will make the current code run faster, but then of course you hit the kernel lockup bug faster too.
I think, i will stick with slower, but not locked up ;)
I compiled latest version, i can see periods of time when it runs about 40MB/s. Still, we have around 26TB of (zstd compressed) data on disk...
Afaik the following recent doc updates indicate this -c 1
workaround to avoid the freezes isn't needed any more - can you confirm that?
https://github.com/Zygo/bees/commit/3d5ebe4d4094955b4eee767f415f27d81adbc4b7
I am just recovering (hopefully recovered) from some elusive hardware issues resulting in freezes, so I'll wait for a week or two of no freezes before trying out higher parallelism :)
Afaik the following recent doc updates indicate this -c 1 workaround to avoid the freezes isn't needed any more - can you confirm that?
Yes. Commit a2e1887c525c3c2ef3e8daeb787ccb21f255eff7 prevents bees from triggering the bug, regardless of the number of threads bees runs.
The kernel bug still exists, but -c1 and the new workaround in the code have the same (low) risk of triggering it.
This night we encountered the following problem and found bees stuck with using 100% of one core:
Tasks could not be killed, so I used
s,u,b
on/proc/sysrq-trigger
to remotely reboot the system.It's now running with kernel 5.15.23, the error occurred on 5.15.11.
Do you know if this particular problem is fixed in the kernel? Otherwise, I'll leave it for reference here until we encounter it again.