deinstapel / cpupower

Manage the frequency scaling driver of your CPU (Intel Core and AMD Ryzen processors supported)
GNU General Public License v3.0
534 stars 71 forks source link

cpufreqctl: avoid boosting when measuring pstate frequency #202

Open NicolasDerumigny opened 2 years ago

NicolasDerumigny commented 2 years ago

Insert a sleep to avoid artificial high CPU frequency (boosting) when using cpufreqctl info current on a i7-1065G7 (Lenovo Yoga C940)

NicolasDerumigny commented 2 years ago

Rebased onto master, also testes functional on the latest version of Arch (GNOME 43).

fin-ger commented 2 years ago

So, the main problem is the API with which the extension is interfacing with the cpufreqctl tool:

# cpufreqctl --format json info current 
{
  "min": 2095908,
  "max": 4192231,
  "avg": 2916799,
  "rnd": 4192206
}

On my AMD Ryzen this creates no problems, but does create problems when

  1. Your CPU has a lot of cores
  2. Some of your cores are sleeping and get waked by this call as this always iterates over all cores
  3. The kernel implementation for (mostly very new CPUs) is still rough leading to a couple of seconds where the system looks frozen (as e.g. seen when Ryzen 5000 launched and Intel 12000)

This API has 1 positive thing and 1 negative thing. The positive thing is that it provides 4 different ways to gather the CPU frequency with the rnd method choosing a random core and not waking all the cores for getting "a" current frequency. The downside is (you might have already guessed) every method is called every time a frequency is fetched. So regardless which frequency-gathering method you choose in the preferences of the extension, all methods are queried every time a frequency is fetched. This is bad.

So, the first thing would be to redesign the API to something like this:

# cpufreqctl --format json info current min
1863003
# cpufreqctl --format json info current max
4192190
# cpufreqctl --format json info current avg
2597208
# cpufreqctl --format json info current rnd
2104293

Thereby, all the frequency method can be queried individually. This is the first step. The next step would be to make min, max, avg and rnd aware of sleeping cores. I figure this one is really hard, also considering the whole thing with performance and efficiency cores in the Intel world. I do not know if this is even possible. But, it is crucial for this extension to not wake cores which are already sleeping for a stupid frequency display. This is where I think the main problem lies when people report this extension making battery life of their laptops worse.

So why was this API designed like this in the first place?

Because I wanted to be able to easily add new frequency gathering methods as e.g. the min and max methods have clear disadvantages for efficiency/performance core architectures as not all cores share the same frequency spectrum. But I think the approach described above would be flexible enough to allow other frequency gathering methods to be added quite easily.

Apart from changing the cpufreqctl interface, all uses of the old interface in the extension have to be adjusted to only gather the single selected method. I think it is straight forward to integrate the new interface, but maybe there are some gotchas.

I am unsure whether something like:

# cpufreqctl --format json info current list
[
  "min",
  "max",
  "avg",
  "rnd"
]

is necessary to query for the available gathering methods. Maybe hard-coding this into the extension is fine as all the methods require extension-specific UI elements like translations and explanations texts anyway.

So, I think the boosting and high CPU load are a side-effect of this utterly flawed implementation for gathering CPU frequencies. If adjusted to only query 1 of the active cores and not waking any core that is sleeping this problem would be solved for not only the pstate driver but for all drivers.

NicolasDerumigny commented 2 years ago

Thanks for your point of view!

I am unsure what you mean by "waking up sleeping core", as sleeping core in the sense of actual sleep state exists on Android, not x86. On Intel / AMD CPU, the "sleep state" is just a throttling of the frequency when the OS has nothing to execute on the CPU. Therefore, there is no way (as far as I know) to avoid this boosting, except (which is what I wackly propose in this PR) putting sleep all around your program to limit it load.

Optimizing for battery life / not boosting can consist in rewriting the cpufreqctl script in C so that it does not need all the overhead of being interpreted, and so that is scans only once the frequency for all cores, then uses the info to compute min/max/avg/rnd.

If the only call the extension is doing is cpufreqctl --format json info current and a few others (I saw some backends), I can rewrite it. Though, this will cause dependencies with libc and building hell for new releases, unless we build it statically (but I have no idea of the compliance of that with distro-packaging rules).

Edit: Typo

fin-ger commented 2 years ago

A rewrite in C is not possible as this would violate the Gnome Extension Review Guidelines.

Extensions MUST NOT include binary executables or libraries

We are already stretching the guidelines as we are not providing the cpufreqctl tool as a GJS script but as a Bash script:

Scripts must be written in GJS, unless absolutely necessary

Reviewing Python modules, HTML, and web JavaScript dependencies is out of scope for extensions.gnome.org. Unless required functionality is only available in another scripting language, scripts must be written in GJS.

Regarding your technical question:

I am unsure what you mean by "waking up sleeping core", as sleeping core in the sense of actual sleep state exists on Android, not x86.

Yes, that's ofc correct. But I thought this is generally available for ARM devices under Linux. Is this behavior Android specific? Nevertheless, for x86 this means not using a core unless it is absolutely necessary. The cpufreqctl script includes the cpufreq backend which is a generic driver, also for ARM CPUs.

On Intel / AMD GPU CPU (?), the "sleep state" is just a throttling of the frequency when the OS has nothing to execute on the CPU.

Yes, so I would like to avoid this tool from pushing the frequency of a "low-frequency", NOP'ing core by querying its frequency. I am not really sure how to implement this without waking (e.g. pushing the frequency) the core, but the rnd method is a first step where it will only potentially wake a single core.

But I'm open for any other suggestion solving this problem. However, I am not a big fan of the sleep approach as we are already using GJS timeouts for scheduling the frequency gathering and this is already a sleep. Just throwing in more sleeps will just slow down the extension and make it less responsive. I think we should avoid unnecessary frequency lookups in the first place.

Edit: add link to Gnome Extension Review Guidelines

NicolasDerumigny commented 2 years ago

I see the problem. I don't think probing the cores is the main behavior that lead to boosting. However, multiple calls to awk may be. Can we imagine relying on another back-end than cpufreqctl to read frequency, such as a (separately installed) binary, with a fallback on cpufrectl if the former is not present?

I will think about it, then come back if I have more ideas.

fin-ger commented 2 years ago

Hmm, I think our cpufreqctl which ships with the extension ensures great compatibility regardless of distro and distro-release. I really prefer the "one click and go" approach of this and would not want users to not be able to use this extension after installation because something on their host system must be installed first (also in the correct version, etc.). But yes, the path for gathering the CPU frequency must be as slim as possible. Things like this really need cleanup. But it should be possible by putting each method into its own function. I think awk then is not needed anymore.

NicolasDerumigny commented 2 years ago

Is a GJS implementation out of question (for frequency reading)? This would avoid both the need of a script and the calls to awk.

Another possibility is to send directly the readings in the JSON stdout return, the compute the aggregation method (min/max/avg/rnd) in the GJS extension, avoiding awk in the read_current code path?

fin-ger commented 2 years ago

Is a GJS implementation out of question (for frequency reading)? This would avoid both the need of a script and the calls to awk.

No, this is mostly for historic reasons as it started out as a small bash script and got bigger and bigger. The GJS requirement for scripts got added to the Guidelines after we already implemented the bash scripts, if I remember correctly.

If I remember correctly, reading and writing files has included a lot of boilerplate in GJS, whereas in bash it's just echo and cat for setting and reading config of the CPU. But I don't know if that is still the case. Feel free to investigate further into getting rid of bash (would also be a good thing from a security perspective :sweat_smile:)

Another possibility is to send directly the readings in the JSON stdout return, the compute the aggregation method (min/max/avg/rnd) in the GJS extension

This would still require some interface for reading the frequency of one core, for only performance cores, etc. I think it is simpler to make the cut for API at the min, max, avg, rnd border and not communicate all the "let's select a core for frequency reading" stuff. But if you come up with an elegant solution, I'm up for it. The main downside would be that you could not as easily replicate the readings displayed in the extension from the command line. I would say that's an inconvenience. However, I fear that a lot of details of the individual CPU backends would "bleed" through, so that the extension code would have to deal with pstate, cpufreq, etc.

On another note:

Can you write some test snippets where you infinitely read without a sleep from /sys/devices/system/cpu/cpufreq/policy0/scaling_cur_freq and then from /sys/devices/system/cpu/cpufreq/policy{1,2}/scaling_cur_freq and then 4, then 8 etc. and see how this affects your performance? I am unsure whether fast reads for a single core are the problem or concurrent reads to multiple cores. I suspected the latter, but maybe I'm wrong.

NicolasDerumigny commented 2 years ago

A first simple fix is to run renice -n 19 0 >/dev/null at the very start of the program to decrease its priority. On my machine (Core i7-1065G7), idle frequency is supposed to turn around 1.2 GHz (which is correctly detected with the sleep method). With the mainline detection, the CPU boost over 2 GHz. With the renice one, the impact is limited to a 1.6-1.9 boost.

I analyzed with perf 1000 executions of cpufreq --format json info current, and found that most of the time is spent in:

However, looking at distribution of time across PID shows only 5 % of the time is actually spent in cpufreqctl, and... 45 % on all the awk calls. Then, 10 % roughly is spend in sort. I thing the overhead of the actual call of cmd utilities is the main bottleneck (which is why libc / ld account for most of the time). The only solution for that is to switch to GJS and limits call as much as possible. Do you agree?

For you experiments, I blocked my frequency to 1.4 GHz, no renice and:

I don't really know what to think about that, except that both options look usable.

EDIT: I am printing to /dev/null on those experiments to avoid shell I/O bottleneck

fin-ger commented 2 years ago

However, looking at distribution of time across PID shows only 5 % of the time is actually spent in cpufreqctl, and... 45 % on all the awk calls.

I think this is because awk is reading from a pipe where it has to wait for the frequency to be read by cat (and the kernel). I still don't think this issue is caused by a specific tool, but by the overall design of the API. For min, max, and avg it is necessary to read the frequency of all cores, but not for rnd. But yes, your results show that it might still be beneficial to do a "cooldown" with a short sleep before reading the frequencies. However, if we do this in the pstate backend, we should do this in all backends.

I don't really know what to think about that, except that both options look usable.

Damn it :D Looks like your CPU is not affected by freezing and/or system lock ups. Especially server CPUs like AMD Threadripper and Intel Xeon seem to have difficulty keeping the system responsive when excessively reading CPU frequencies. I expected the parallel approach to have a way higher CPU load and introduce stutter, but this does not seem to be the case. I originally opted for the parallel read approach as especially older AMD CPUs take ages to read a single CPU frequency (1-2 seconds) and a 16 core CPU would then take 30+ seconds for a sequential read of all the frequencies. So although your results clearly show that an API change is not really necessary (your performance numbers look fine), I still think that the API should be changed so that people having CPUs that cannot handle reading the frequencies of all cores can switch to the rnd method and only read the frequency of a single core. Also, this is good for people having higher power consumption due to this extension always polling the frequency of all cores (currently I tell them to completely disable the frequency display).

Do you agree?

You can try writing a proof-of-concept in GJS for just reading the current frequency with the 4 methods and see how the performance numbers change. For your use case this might be beneficial, for others I think an API change is more beneficial. (e.g. #184, #138, #52)

NicolasDerumigny commented 2 years ago

I rewrote it in pure bash (as I don't know JS well, so I'll see later for a pure GJS version) to remove the need of awk and sorton reading. Results:

I tested this (sequential version) on a i9-7980XE (18 cores/36 threads) locked at 1.2 GHz and:

So I cannot reproduce the hang. Maybe it's arch, maybe Intel. I would argue that the nice should be enough to avoid system hangs, but I cannot be sure (GNOME may still hangs while the extension is waiting for the frequency, right? Then the script call should be async with a timeout).

NicolasDerumigny commented 2 years ago

With the latest changes, perf reports 45% of the time in bash, 41 % in libc (I had a look, it is mainly malloc / "cfree") and 0.5 % in cat ; that look better in my opinion.

NicolasDerumigny commented 2 years ago

Let's got with one last try for today. Removing the array did not change the execution time on the 1065G7, but tripled it on the 7980XE (10,5 sec wall time). I'll test as it, then will head for the GJS implementation later.