Closed road2react closed 5 months ago
Hi!
Thanks for the report.
I have just tested the above stressor with the "BORE" (linux-cachyos-bore) and "EEVDF" (linux-cachyos) Scheduler and I did not have a bad responsive.
Using stress-ng --cpu-method loop -c 256
was a good point heavier and did bring the system into a "hassle" but it was quite use able.
Which scheduler did you used in your testings? Only CFS?
@firelzrd Maybe you can bring your expierence and knowledge in :)
Memory stress performance issues are something to do with page fault (page reclamation) handling, so cannot be relieved by CPU scheduler. I had my best ever experience under memory stress with the le9 patch (plus proper configuration of it), but doesn't it seem to work with MGLRU nor Maple Tree, thus unavailable in 6.x. le9 is just totally in a different level. I loved it so bad. https://github.com/hakavlad/le9-patch
Using stress-ng --cpu-method loop -c 256
doesn't cause any lag for me. Programs still work responsively.
Which scheduler did you used in your testings? Only CFS?
I tested using linux-cachyos
(EEVDF) and linux-cachyos-bore
and both lag using the memory stress tester. Changing nice parameters or using ananicy-cpp
doesn't fix the problem either.
I had my best ever experience under memory stress with the le9 patch (plus proper configuration of it)
@firelzrd According to the readme, le9-patch is designed to work in near-OOM situations. However, the memory stress tester only uses a small amount of memory (10MB per thread). Would le9-patch work with this behavior?
le9 only affects in near-OOM situations.
Interesting. I compiled your test code by myself (rustc 1.69.0) and ran it on linux-6.3.0-cachyos-bore kernel self-compiled for Ubuntu, on Ryzen 7 4800U. CPU utilization instantly reaches at 100%, and as you said, the test program doesn't fill all the available RAM but only seems to spin filling very small amount of memory. The only difference here is that I don't feel any lag or slow-down at all while the program is running. both bore=0 and 3 are fine, and the system keeps fluid. There may be some condition to reproduce the issue.
What happens if you give it more threads?
use std::{hint::black_box, thread};
fn main() {
debug_assert!(false, "compile with --release");
thread::scope(|s| {
for i in 0..32 {
s.spawn(move || {
let mut v = vec![0u8; 10_000_000];
loop {
v.fill(i);
black_box(&v);
}
});
}
})
}
Make sure to compile using --release
Okay, I got it lagging now, with the compile options: rustc -C opt-level=3 -C debug_assertions=no and 16 threads is enough to reproduce it.
It only happens when you use black_box(&v);
, so I suppose it may be caused by the unoptimized code issuing tides of small memory store instructions, probably saturate some malloc-like system call or the CPU's AGU internally?
Removing black_box(&v);
causes the compiler to optimize away the memory writes, so it just becomes a simple loop.
https://rust.godbolt.org/z/dMorPWeef
The resulting assembly:
example::spin:
.LBB0_1:
jmp .LBB0_1
Putting back black_box(&v);
makes it call memset
.
the CPU's AGU internally
@firelzrd do you know of any ways I can measure the level of use of different resources like that one?
I don't have an idea right now. I'll have to look into it. We're going to have to find why memset flood harms system responsiveness so badly. But since it happens in so macroscopic timescale, it's probably software issue, not hardware.
I ran sudo perf top
to get some more information.
When the memory stress tester is running, theres some overhead occurring in libc
, which is most likely memset
. Maybe this would cause the scheduler to misjudge the workload that is being used.
But since it happens in so macroscopic timescale, it's probably software issue, not hardware.
Makes sense. I tested the same program on Windows and although it causes some lag, it doesn't cause as much lag as what happens on Linux.
@firelzrd
I did last time also check it with perf top. At me it was also "memset".
Great! Thank you for the detailed investigation. As you showed, obviously memset() is causing this. Now the question is HOW it is harming the responsiveness.
A. memset() takes so much time and it cannot be interrupted by scheduler's forced preemption or B. After all the CPU L1/L2/L3 cache data is pushed out by the test program, it takes very long time for other tasks to read them all every time from the memory again
If case A, maybe it's worth to try the -rt patch and see how it behaves. If case B, it's a difficult problem, but increasing the scheduling time slice might be a cure.
I tried the -rt patch (using Arch linux-rt
) and it doesn't fix the problem.
Okay, then I suppose that the glibc memset (or generally those embedded functions) implementation itself has some interactivity problem.
As expected, CPU cache pollution stress test such as stress-ng -C
didn't harm the responsiveness much like your program, so.
Although issuing tons of memset like this may be a very rare case, I dislike to see this type of responsiveness problems happening, and whenever I see them, I crave to get rid of those.
But unfortunately to say, this one seems not specifically CachyOS kernel's problem, so let's maybe talk about it outside of here.
I highly appreciate your interest on this issue.
Here's a Rust program that continuously writes to memory:
Keeping this running seems to make the desktop very unresponsive (over 1 second when switching windows). Changing scheduler parameters (nice levels,
SCHED_FIFO
) doesn't seem to fix the problem. This also occurs on all other kernels I have tested (including vanilla, zen, tkg, and cachyos).This is different from other stress test programs that stress the CPU, since other stress test programs (such as
stress-ng
) or running a CPU-intensive workload (such as compiling a large program) do not cause a major effect like this one does. This memory stress tester relates more with workloads such as Stockfish (with a large Hash size) that use more memory.If it's relevant, the hardware I'm using is ASUS G513QY (AMD Ryzen 9 5900HX).