CachyOS / linux-cachyos

Archlinux Kernel based on different schedulers and some other performance improvements.
https://cachyos.org
GNU General Public License v3.0
887 stars 35 forks source link

Lag occurs when running a memory stress tester #102

Closed road2react closed 5 months ago

road2react commented 1 year ago

Here's a Rust program that continuously writes to memory:

use std::{hint::black_box, thread};

fn main() {
    thread::scope(|s| {
        for i in 0..16 {
            s.spawn(move || {
                let mut v = vec![0u8; 10_000_000];
                loop {
                    v.fill(i);
                    black_box(&v);
                }
            });
        }
    })
}

Keeping this running seems to make the desktop very unresponsive (over 1 second when switching windows). Changing scheduler parameters (nice levels, SCHED_FIFO) doesn't seem to fix the problem. This also occurs on all other kernels I have tested (including vanilla, zen, tkg, and cachyos).

This is different from other stress test programs that stress the CPU, since other stress test programs (such as stress-ng) or running a CPU-intensive workload (such as compiling a large program) do not cause a major effect like this one does. This memory stress tester relates more with workloads such as Stockfish (with a large Hash size) that use more memory.

If it's relevant, the hardware I'm using is ASUS G513QY (AMD Ryzen 9 5900HX).

ptr1337 commented 1 year ago

Hi!

Thanks for the report. I have just tested the above stressor with the "BORE" (linux-cachyos-bore) and "EEVDF" (linux-cachyos) Scheduler and I did not have a bad responsive. Using stress-ng --cpu-method loop -c 256 was a good point heavier and did bring the system into a "hassle" but it was quite use able.

Which scheduler did you used in your testings? Only CFS?

@firelzrd Maybe you can bring your expierence and knowledge in :)

firelzrd commented 1 year ago

Memory stress performance issues are something to do with page fault (page reclamation) handling, so cannot be relieved by CPU scheduler. I had my best ever experience under memory stress with the le9 patch (plus proper configuration of it), but doesn't it seem to work with MGLRU nor Maple Tree, thus unavailable in 6.x. le9 is just totally in a different level. I loved it so bad. https://github.com/hakavlad/le9-patch

road2react commented 1 year ago

Using stress-ng --cpu-method loop -c 256 doesn't cause any lag for me. Programs still work responsively.

Which scheduler did you used in your testings? Only CFS?

I tested using linux-cachyos (EEVDF) and linux-cachyos-bore and both lag using the memory stress tester. Changing nice parameters or using ananicy-cpp doesn't fix the problem either.

I had my best ever experience under memory stress with the le9 patch (plus proper configuration of it)

@firelzrd According to the readme, le9-patch is designed to work in near-OOM situations. However, the memory stress tester only uses a small amount of memory (10MB per thread). Would le9-patch work with this behavior?

firelzrd commented 1 year ago

le9 only affects in near-OOM situations.

Interesting. I compiled your test code by myself (rustc 1.69.0) and ran it on linux-6.3.0-cachyos-bore kernel self-compiled for Ubuntu, on Ryzen 7 4800U. CPU utilization instantly reaches at 100%, and as you said, the test program doesn't fill all the available RAM but only seems to spin filling very small amount of memory. The only difference here is that I don't feel any lag or slow-down at all while the program is running. both bore=0 and 3 are fine, and the system keeps fluid. There may be some condition to reproduce the issue.

road2react commented 1 year ago

What happens if you give it more threads?

use std::{hint::black_box, thread};

fn main() {
    debug_assert!(false, "compile with --release");
    thread::scope(|s| {
        for i in 0..32 {
            s.spawn(move || {
                let mut v = vec![0u8; 10_000_000];
                loop {
                    v.fill(i);
                    black_box(&v);
                }
            });
        }
    })
}

Make sure to compile using --release

firelzrd commented 1 year ago

Okay, I got it lagging now, with the compile options: rustc -C opt-level=3 -C debug_assertions=no and 16 threads is enough to reproduce it.

It only happens when you use black_box(&v);, so I suppose it may be caused by the unoptimized code issuing tides of small memory store instructions, probably saturate some malloc-like system call or the CPU's AGU internally?

road2react commented 1 year ago

Removing black_box(&v); causes the compiler to optimize away the memory writes, so it just becomes a simple loop.

https://rust.godbolt.org/z/dMorPWeef

The resulting assembly:

example::spin:
.LBB0_1:
        jmp     .LBB0_1

Putting back black_box(&v); makes it call memset.

road2react commented 1 year ago

the CPU's AGU internally

@firelzrd do you know of any ways I can measure the level of use of different resources like that one?

firelzrd commented 1 year ago

I don't have an idea right now. I'll have to look into it. We're going to have to find why memset flood harms system responsiveness so badly. But since it happens in so macroscopic timescale, it's probably software issue, not hardware.

road2react commented 1 year ago

I ran sudo perf top to get some more information.

image

When the memory stress tester is running, theres some overhead occurring in libc, which is most likely memset. Maybe this would cause the scheduler to misjudge the workload that is being used.

But since it happens in so macroscopic timescale, it's probably software issue, not hardware.

Makes sense. I tested the same program on Windows and although it causes some lag, it doesn't cause as much lag as what happens on Linux.

@firelzrd

ptr1337 commented 1 year ago

I did last time also check it with perf top. At me it was also "memset".

firelzrd commented 1 year ago

Great! Thank you for the detailed investigation. As you showed, obviously memset() is causing this. Now the question is HOW it is harming the responsiveness.

A. memset() takes so much time and it cannot be interrupted by scheduler's forced preemption or B. After all the CPU L1/L2/L3 cache data is pushed out by the test program, it takes very long time for other tasks to read them all every time from the memory again

If case A, maybe it's worth to try the -rt patch and see how it behaves. If case B, it's a difficult problem, but increasing the scheduling time slice might be a cure.

road2react commented 1 year ago

I tried the -rt patch (using Arch linux-rt) and it doesn't fix the problem.

firelzrd commented 1 year ago

Okay, then I suppose that the glibc memset (or generally those embedded functions) implementation itself has some interactivity problem. As expected, CPU cache pollution stress test such as stress-ng -C didn't harm the responsiveness much like your program, so. Although issuing tons of memset like this may be a very rare case, I dislike to see this type of responsiveness problems happening, and whenever I see them, I crave to get rid of those. But unfortunately to say, this one seems not specifically CachyOS kernel's problem, so let's maybe talk about it outside of here. I highly appreciate your interest on this issue.