Incomplete suspend with 6.6.34

ghibo commented 3 months ago

Hi. I wonder if someone is getting a weird problem like this. Using the latest 6.6.34 (but was there also on previouses) with bore 5.1.0, the system suspend is not working completely, i.e. for instance running 'systemctl suspend' the system is suspended but not the fans which continue to spinning. The same problem doesn't happen in the same kernel built with bore disabled (i.e. a kernel where the only difference in the setup is the CONFIG_BORE [not set] in the .config file). This is weird and is even difficult to track.

xalt7x commented 3 months ago

Does disabling BORE with sysctl help here? sudo sysctl -w kernel.sched_bore=0 (also you can auto-apply it on boot with override on /etc/sysctl.d)

ptr1337 commented 3 months ago

Does disabling BORE with sysctl help here? sudo sysctl -w kernel.sched_bore=0 (also you can auto-apply it on boot with override on /etc/sysctl.d)

It can not be disabled via sysctl anymore.

firelzrd commented 3 months ago

Thank you for the report. That sounds very strange, because BORE has nothing to do with ACPI or power management.

ghibo commented 3 months ago

Thank you for the report. That sounds very strange, because BORE has nothing to do with ACPI or power management.

Yep, that's weird, because ACPI or PM code it's not involved in the bore code, but maybe could be a side-effect because of some suspend process scheduled differently internally or skipped or not completed in some way?

I also tried playing with parameters: kernel.sched_burst_cache_lifetime=2000000000, kernel.sched_burst_penalty_offset=0, kernel.sched_burst_penalty_scale=(0 or 4095), kernel.sched_burst_smoothness_long=0, but hasn't changed anything regarding this problem.

firelzrd commented 3 months ago

Yeah, that's possible. FYI, sched_burst_penalty_scale=0 works almost the same as plain EEVDF, except that the tasks with >0 burst scores may still be delayed based on the already-set burst scores until they next get dequeued. After setting sched_burst_penalty_scale=0, all burst scores of dispatched tasks are set 0 at every chance after that moment, rapidly resulting no prioritization to be done, especially if your test runs multiple times. If you still observe the problem occurring, at least it's highly negative that it's occurring from the prioritization itself.

I'm trying to reproduce your issue so that I can start analyzing it, but so far I have no luck. Do you have any suggestion for how to reproduce it?

ghibo commented 3 months ago

I think I've found the reasons of the side-effects. It seems that when using "systemctl suspend" the system is using in some way the kernel logs to operate, but sometimes happens that for suspend, systemd fails with errors: "systemd-journald[...]: /dev/kmsg buffer overrun, some messages lost.". Apparently this happens only in BORE, because BORE generates more kernel logs than standard EEVDF and thus triggers that errors, while plain EEVDF code, which seems having fewer logs calls in code, doesn't. Increasing the log buffer size to an higher value, e.g. by adding "log_buf_len=4M" to the booting command line, or changing the default value of CONFIG_LOG_BUF_SHIFT to 22 in kernel's .config, would let would let BORE works with "systemctl suspend" correctly.

firelzrd commented 3 months ago

Great! Thank you for the testing. So you mean somehow BORE kernel generates more log messages compared to EEVDF kernels upon suspend events, and that's passively causing buffer overruns, and that's the reason your suspends have been failing. I don't know how exactly BORE generates more messages, but that's a good hint for me to start investigating what's really going on with BORE. I'll be back with more updates later. And I appreciate your kind cooperation, ghibo!

firelzrd commented 3 months ago

Unfortunately so far I haven't been able to see any difference in generated log message count between EEVDF and EEVDF-BORE when going to suspend state.

suspend-in-eevdf.log suspend-in-bore.log

Should there be something specific in your particular setup...?

ghibo commented 3 months ago

What I noticed it that it tends to occur in laptops with CPUs with less cores (e.g 2-4 cores) than in desktop with CPUs with more cores (e.g. 8-16 cores). An hypothesis could be that with more cores you get a larger ring buffer log size by default, because it's also allocated due to the parameter CONFIG_LOG_CPU_MAX_BUF_SHIFT which is on a per-core basis. A tip could be to "shrink" the log buffer ring size at boot instead of enlarging to try to hit something, e.g. passing to boot cmdline something like: log_buf_len=64k or log_buf_len=8k and see what happens.

firelzrd / bore-scheduler

Incomplete suspend with 6.6.34 #41