Closed ThomasKaiser closed 2 years ago
And while we're at it let's benchmark some benchmarks. Here with regard to the influence of DRAM clockspeed: how this has an effect on especially memory bandwidth and latency and the scores used currently by sbc-bench
+ stockfish
.
The values as follows:
DRAM
is the DRAM clock in MHz configured via userspace
DMC governor7-zip multi
7-ZIP MIPS generated with all cores (A76 at ~2360 MHz, A55 at 1840 MHz)7-zip single
7-ZIP MIPS done on an A76 at ~2360 MHzAES
is from an A76 and always the same since ARMv8 Crypto Extensions do the job and the score scales linearly with CPU clockspeedmemcpy
score from from an A76 reported by tinymembenchmemset
score from from an A76 reported by tinymembench4M ns
'single random read' / 'dual random read' latency from an A76 with 4M block size reported by tinymembench64M ns
'single random read' / 'dual random read' latency from an A76 with 64M block size reported by tinymembenchkH/s
cpuminer scores generated on all cores working in parallelstockfish
is the 'Nodes per second' score generated on all cores with stockfish bench 128 8 24 default depth
DRAM | 7-zip single | 7-zip multi | AES | memcpy | memset | 4M ns | 64M ns | kH/s | stockfish |
---|---|---|---|---|---|---|---|---|---|
528 | 2587 | 13050 | 1344830 | 3570 | 8450 | 63.2/99.3 | 235.8/271.3 | 22.06 | 3238057 |
1068 | 2940 | 15120 | 1344500 | 6270 | 16950 | 46.9/73.6 | 166.3/192.2 | 22.05 | 4122771 |
1560 | 3086 | 16040 | 1344060 | 8620 | 24390 | 38.6/58.8 | 139.9/158.0 | 22.03 | 4653285 |
2112 | 3167 | 16640 | 1343220 | 10850 | 29330 | 35.7/53.7 | 123.2/139.0 | 22.03 | freeze |
To interpret the results (not talking about memory bandwidth/latency since these numbers are self-explanatory):
7-zip single
single-threaded score depends highly on memory latency as such lower DRAM clock which results in massively higher latency negatively affects the scores. The scores when generated with 7-zip v16.02 are almost the same regardless of distribution in question thanks to p7zip
package on Linux more or less being unmaintained. At least Debian Stretch, Buster, Bullseye and Ubuntu Bionic, Focal, Jammy, Kinetic all ship with v16.02 and 7-zip MIPS on same hardware with otherwise identical settings generate the same score for over six consecutive years now (7-zip distro packages built with GCC 6.3 up to GCC 12.2)7-zip multi
: the same applies as for 7-zip single
but there's a huge caveat: depending on kernel version the multi-threaded scores can differ significantly but that's not a benchmarking flaw but also affects real-world tasks supposed to run fully parallel – see the ODROID-XU4 example belowAES
is from an A76 and always the same since ARMv8 Crypto Extensions do the job and the score scales linearly with clockspeedkH/s
cpuminer scores are not affected by DRAM clock (working set too small so everything fits into CPU caches) but by compiler version and flags (see the three Rock64 1400 MHz scores in my results list that only differ by GCC 6.3 vs. 7.3 vs. 8.2 or the fact that cpuminer
generates a 25.31 score when built with GCC 9.3 vs. the 20% lower score when built with GCC 12.2 as above. Not always does a higher compiler version number result in better scores)stockfish
OTOH depends significantly on DRAM clock. So far no idea whether that's related to bandwidth, latency or both.Speaking about the 7-zip multi
scores... those above were all generated with same kernel version (a smelly 5.10 Rockchip BSP kernel). But with different kernel versions multi-threaded behaviour can change significantly as already outlined in my reasoning to use 7-zip as benchmark.
Let's have a look on kernel version and ODROID-XU4:
Kernel / Compiler | 7-zip single | 7-zip multi | CPU utilisation compression | CPU utilisation decompression |
---|---|---|---|---|
Kernel 4.9 / GCC 6.3 | 1622 | 6370 | 64% | 78% |
Kernel 4.14 / GCC 7.3 | 1633 | 7100 | 64% | 78% |
Kernel 5.4 / GCC 9.3 | 1604 | 8980 | 94% | 84% |
The single-threaded score is the same with all kernel versions but the multi-threaded scores differ a lot and also the reported CPU utilization. It's a scheduler and not a benchmark problem.
Another suggestion from cnx-software: rule out the A55 cores:
root@rock-5b:/home/tk# echo performance >/sys/devices/platform/dmc/devfreq/dmc/governor
root@rock-5b:/home/tk# echo performance >/sys/devices/system/cpu/cpufreq/policy4/scaling_governor
root@rock-5b:/home/tk# echo performance >/sys/devices/system/cpu/cpufreq/policy6/scaling_governor
root@rock-5b:/home/tk# for i in 3 2 1 0 ; do echo 0 >/sys/devices/system/cpu/cpu${i}/online; done
root@rock-5b:/home/tk# htop (confirm that A55 cores are offline)
root@rock-5b:/home/tk# phoronix-test-suite benchmark pts/stockfish-1.4.0
...
Stockfish 15:
pts/stockfish-1.4.0 [Total Time]
Test 1 of 1
Estimated Trial Run Count: 3
Estimated Time To Completion: 14 Minutes [09:38 CET]
Started Run 1 @ 09:24:08
Started Run 2 @ 09:28:58
Rock 5B frozen after 4:45m. Reported consumption 'at wall': 9-10W (all measurements with active fan which contributes 700mW to measurements).
First implementation done: https://github.com/ThomasKaiser/sbc-bench/commit/bddc8d44c04c744ad0c341a480e3312a8dfce24e
Armbian has updated to the latest bl31 firmware since this commit. You have to see the current used firmware from serial console output,
@amazingfate sbc-bench -s
reliably freezes my Rock 5B even with latest BLOBs on 2112 MHz DRAM clock.
From cnx-software.
First invocation on Rock 5B in lazy mode (
phoronix-test-suite benchmark pts/stockfish-1.4.0
) already ended up with the board freezing at the 2ndstockfish
run. Attaching fan to power and repeating again also again freeze during 2ndstockfish bench 128 8 24 default depth
run.General problem was already known since so far on some boards highest DRAM clock wasn't usable and users needed to switch from 2112 MHz to 1560 MHz for stable operation.
My board hasn't seen any freezes on highest DRAM clock so this was a surprise. By updating my Armbian image to latest version I was hoping for getting most recent boot BLOBs as part of
u-boot
package. It now readsii linux-u-boot-rock-5b-legacy 22.11.0-trunk.0106 arm64 Uboot loader 2017.09
but problems got even worse and now the board freezes on 2112 MHz DRAM clock already at 1st benchmark execution. Maybe @amazingfate can comment on whether my OS image is expected to run on latest BLOBs or not?With lower DRAM clock everything works as expected but at 2112 MHz DRAM clock the board freezes regardless of the A76's clockspeeds (and as such DVFS/consumption) so it looks solely related to DRAM clock:
With other CPU benchmarks I haven't seen consumption exceeding 9W on Rock 5B so
stockfish
is really a potent load generator / stability tester. On top of making heavy use of SIMD extensions it also is heavy on memory access: walking through the different DRAM clockspeeds ended up with significantly different scores: https://openbenchmarking.org/result/2211099-NE-2211093NE82Quick check on an AMD EPYC 7232P (8C/16T) thing also hints at
stockfish
being more demanding than bothcpuminer
and7-zip
:First chart is from a NetIO powermeter (measuring at the wall), 2nd is the server's internal BMC showing PSU1 (PSU2 is always in standby on this machine so the whole productive consumption is PSU1's thing), the last two are the BMC measurements for CPU and DRAM separately (though no idea to which number the memory controller contributes):