Dr-Noob / cpufetch

Simple yet fancy CPU architecture fetching tool
GNU General Public License v2.0
1.9k stars 103 forks source link

Use `/proc/cpuinfo` for frequency measurement #266

Open Dr-Noob opened 2 months ago

Dr-Noob commented 2 months ago

In some systems, measure the frequency is not possible due to perf_event_open being unavailable (like in #260). Thus, having a fallback that relies on /proc/cpuinfo would improve resilency of the method.

ThomasKaiser commented 2 months ago

Relying on /proc/cpuinfo to determine cpufreqs can be misleading (at least) on some ARM SoCs since some vendors were/are notorious cheaters, e.g. Amlogic, Allwinner and Raspberry Pi Ltd.

Here an example for a VIM2S relying on Amlogic S912: https://github.com/ThomasKaiser/sbc-bench/blob/master/results/1iJ7.txt#L51

With the boot BLOB Khadas got, the cpufreqs of the 'bigger' cluster will be faked as 1.5 GHz while it's 1.4 in reality. Android TV boxes relying on this SoC will often fake 2 GHz and with for example S905W it's even more funny since this SoC fakes 2.0 GHz too but only runs at 1.2 GHz in reality.

Amlogic/Allwinner cheating was the main reason to integrate https://github.com/wtarreau/mhz in sbc-bench to spot such differences between advertised and real cpufreqs.

Dr-Noob commented 2 months ago

This is really interesting! Thanks for sharing.

ThomasKaiser commented 2 months ago

I wonder what is Cpufreq OPP in your sbc-bench?

Just walking through /sys/devices/system/cpu/cpufreq/policy?/scaling_available_frequencies to check each individually.

rdtsc, which is x86 specific. What is your fallback for ARM then?

I have no fallback, am just building it (sbc-bench simply clones the repo followed by make and that's it). So far it's working fine on ARM, RISC-V and x86. BTW: @wtarreau is the one who spotted Allwinner, Amlogic and Rockchip cheating and actually suggested relying on mhz in sbc-bench :)

Dr-Noob commented 2 months ago

Okay, it does not use rdtsc for the max frequency calculation, but instead uses something platform-agnostic, that's why it works on non x86. I'll definitely schedule some time to investigate this further :+1:

Dr-Noob commented 2 months ago

Related to Apple SoCs (#230), where I basically hardcode the max frequency (bad), I wonder if an approach similar to mhz might also work? Or it does not work under macOS?

wtarreau commented 2 months ago

Mhz is not OS-specific. I've used it on AIX, *BSD, Linux, OS-X, Solaris etc. It only needs to find a relatively trustable clock source (i.e. the venerable gettimeofday() which every OS has) and that's all. For the operations, they're extremely simple, it just creates a long sequence of dependent single-cycle instructions that the CPU cannot optimize away so that it's effectively able to count the time it takes to perform N operations, hence N cycles. There's a compensation for the cond jump at the end by comparing two distinct loops, but overall it's quite accurate and variations are around 1-to-2 / 1000, which is not bad.

As @ThomasKaiser said, it served us at a time where there was a race to the biggest liars between CPU vendors. By then I was really fed up with not knowing what I was buying and wanted to make something trustable to assess the hardware and point the finger at the liars. I didn't have much time to work on that beyond a few basic tools, and when I saw Thomas come with an already fairly complete sbc-bench, I said it was exactly what I had in mind. I humbly think that together we managed to force a little bit of cleanup in this domain so that it's now more difficult to cheat without being noticed. BTW we haven't caught amlogic nor rockchip cheating anymore after the tools became popular enough to be run by reviewers to verify they were not losing their time ;-)

ThomasKaiser commented 2 months ago
tk@mac-tk ~ % git clone  https://github.com/wtarreau/mhz
Cloning into 'mhz'...
remote: Enumerating objects: 51, done.
remote: Counting objects: 100% (51/51), done.
remote: Compressing objects: 100% (33/33), done.
remote: Total 51 (delta 18), reused 50 (delta 17), pack-reused 0 (from 0)
Receiving objects: 100% (51/51), 11.53 KiB | 2.31 MiB/s, done.
Resolving deltas: 100% (18/18), done.

tk@mac-tk ~ % cd mhz 

tk@mac-tk mhz % make
gcc -O3 -Wall -fomit-frame-pointer  -o mhz.o -c mhz.c
gcc  -o mhz mhz.o

tk@mac-tk mhz % ./mhz 
count=1263118 us50=15599 us250=78000 diff=62401 cpu_MHz=4048.390

Though I've no idea how to measure the efficiency cores since something like taskset is missing in macOS and the scheduler in charge of sending the task to a performance core.

But there would be a different approach: Firing up something in N threads (corresponding to number of cores) and then using powermetrics to get the cpufreqs:

tk@mac-tk ~ % sudo powermetrics -s cpu_power | grep " frequency:"
E-Cluster HW active frequency: 1255 MHz
CPU 0 frequency: 1265 MHz
CPU 1 frequency: 1276 MHz
CPU 2 frequency: 1287 MHz
CPU 3 frequency: 1315 MHz
P-Cluster HW active frequency: 0 MHz
CPU 4 frequency: 1866 MHz
CPU 5 frequency: 1673 MHz
CPU 6 frequency: 1697 MHz
CPU 7 frequency: 2020 MHz
^C

But there is another problem (not tested with M3 but with M1 and M2): once there are more than N cores busy the maximum cpufreq will be decreased automagically. At least true for the performance cores. E.g. 3200 MHz became 3000 MHz on the MacBook Air M1 back then when more than 2 cores were busy though this may depend on power capping and may differ on different models.

wtarreau commented 2 months ago

Yeah I remember about this difficulty or even impossibility to bind to specific cores on this OS, it's pretty annoying. They probably consider it as a feature to prevent the user from helping the scheduler make the right decisions... I've even found a question about this which was roughly replied to as "simple, you just don't need to do that, period".

ThomasKaiser commented 2 months ago

WRT powermetrics at least with most recent macOS it kinda works:

MacBook Pro M1:

E-Cluster HW active frequency: 2064 MHz
CPU 0 frequency: 2064 MHz
CPU 1 frequency: 2064 MHz
CPU 2 frequency: 2064 MHz
CPU 3 frequency: 2064 MHz
P-Cluster HW active frequency: 2980 MHz
CPU 4 frequency: 3204 MHz
CPU 5 frequency: 3204 MHz
CPU 6 frequency: 3204 MHz
CPU 7 frequency: 3204 MHz

MacBook Air M3:

E-Cluster HW active frequency: 2746 MHz
CPU 0 frequency: 2746 MHz
CPU 1 frequency: 2746 MHz
CPU 2 frequency: 2746 MHz
CPU 3 frequency: 2746 MHz
P-Cluster HW active frequency: 3636 MHz
CPU 4 frequency: 4056 MHz
CPU 5 frequency: 4056 MHz
CPU 6 frequency: 4056 MHz
CPU 7 frequency: 4056 MHz

The 'load generator' was a silly for i in $(seq 1 8) ; do yes >/dev/null & done. So while the clockspeeds of the P-cores are limited with multi-threaded loads (3636 MHz on the M3) the max cpufreqs can still be read.

But no idea with which macOS version this started and not able to test on anything prior to 14.6.1/23G93 (we don't do 'patch management' here but instead patch everything always immediately).

Did also a parallel run of 8 mhz instances that were almost evenly distributed between P- and E-cores at runtime:

count=837844 us50=13131 us250=65750 diff=52619 cpu_MHz=3184.568
count=990858 us50=14901 us250=78252 diff=63351 cpu_MHz=3128.153
count=1024017 us50=15093 us250=80461 diff=65368 cpu_MHz=3133.083
count=1084166 us50=16565 us250=83723 diff=67158 cpu_MHz=3228.702
count=1153449 us50=17700 us250=85609 diff=67909 cpu_MHz=3397.043
count=1161896 us50=17837 us250=86627 diff=68790 cpu_MHz=3378.096
count=1173383 us50=17768 us250=84004 diff=66236 cpu_MHz=3543.037
count=1255138 us50=18984 us250=85701 diff=66717 cpu_MHz=3762.573
Dr-Noob commented 2 months ago

It's great to have you here @wtarreau! Very nice tool. I also tried doing something similar (and I integrated it into cpufetch here), but instead of the RAW operations you are using, I'm using nops. However, the biggest challenge I found is the number of cycles. In my experiments I found that the number of cycles cannot be predicted, e.g., in my nop_function I have a loop that does INSERT_ASM_1000_TIMES 4 x iters times, so one would expect this to take (around) 1000 x 4 x iters cycles, but this is not the case. I understand that the loop would make things a bit unpredictable but, in my case, assuming the number of cycles would yield a quite high error (much bigger than 1/1000). That's why I use perf_event_open to count the number of cycles, which makes this method more accurate, but way less portable. So I wonder what is different in your approach. Is the instructions you use the key to make it more reliable or is it how you perform the loop?

Regarding the inability to set the thread affinity in macOS, I would like to join your annoyment. I also needed this for cpufetch and I had to implement this: make a loop and constantly check the current core until the scheduler decides to move the process to the core I want. Extremely dirty, inefficient and unpleasant, but works for my use case. Well, I don't know what is worse, this hack or macOS not giving the developers the tools needed to do a proper work.

Also, I think I still don't get what you guys mean when you say vendors (amlogic, rockchip) were lying about max frequency. Does it mean that they modified the kernel to report higher frequencies than the actual ones?

wtarreau commented 2 months ago

The difference is that modern CPUs use instruction fusion in the decoding stage and will merge most NOPs and even eliminate them. 35 years ago I was using NOP on 8088, it was OK (and even allowed to distinguish 8088 from 8086 by overwriting them, one had a 4-byte prefetch queue while the other had 6). But NOPs are totally unusable nowadays. I'm not surprised by your random measurements. They could even depend on code alignment, depending how instructions are fetched and merged together.

Regarding vendors cheating. Yes that's it but not just that. Rockchip kernel at the era of RK3288 (kernels 3.10 and 4.4 IIRC) would indeed enforce a hard limitation in the cpufreq driver to silently ignore higher frequencies. Usually the limit was set to 1.608 GHz, but unscrupulous (or sometimes unsuspecting) board vendors would advertise 1.8 GHz (probably after verifying that it still worked once changing it in the device tree, without realizing that if it was stable, it was because it wasn't 1.8 GHz). For amlogic, it was worse, they didn't modify the driver, it looks like it was an MCU inside the SoC that was enforcing a hard limit. Hardkernel used to advertise and sell their Odroid-C2 as a quad-2.0 GHz CPU except that it was a quad-1.536 one. After this was disclosed, they apologized for not noticing it. Given how they annoyed amlogic to get different blobs to try to set higher, stable frequencies, I really think they were honest and didn't notice that scam by themselves. That was too much, really. When vendors cheat at the hardware level, you have to find other ways to let users figure by themselves what they're buying. Nowadays the situation has significantly improved on this point!