Closed geerlingguy closed 4 months ago
pi@pi5:~/linux $ uname -a
Linux pi5 6.6.31+rpt-rpi-2712 #1 SMP PREEMPT Debian 1:6.6.31-1+rpt1 (2024-05-29) aarch64 GNU/Linux
Type | Run 1 | Run 2 | Run 3 | Average |
---|---|---|---|---|
Single | 802 | 799 | 801 | 801 |
Multi | 1721 | 1717 | 1731 | 1723 |
Link | result | result | result | - |
Stat | Run 1 | Run 2 | Run 3 | Average |
---|---|---|---|---|
Power (avg) | 11.5W | 11.2W | 11.1W | 11.3W |
Result | 28.839 Gflops | 28.395 Gflops | 28.545 Gflops | 28.593 Gflops |
Efficiency | 2.51 Gflops/W | 2.53 Gflops/W | 2.57 Gflops/W | 2.54 Gflops/W |
Power consumption graph showing HPL run tail end, and two of the Geekbench 6 runs. System uptime is over 24 hours.
# After a quick rebuild of the current Pi 64-bit kernel source:
pi@pi5:~ $ uname -a
Linux pi5 6.6.36-v8-16k+ #1 SMP PREEMPT Sat Jul 6 23:19:18 CDT 2024 aarch64 GNU/Linux
mbox.gz
from this patch thread.linux
checkout from Raspberry Pi's build the Linux kernel guide, run: git am PATCH-2-2-arm64-numa-Add-NUMA-emulation-for-ARM64.mbox
(skip empty messages)make menuconfig
(requires libncurses-dev
be installed via apt
)
# After rebuilding the kernel with the NUMA patch:
pi@pi5:~ $ uname -a
Linux pi5 6.6.36-v8-16k+ #2 SMP PREEMPT Sun Jul 7 00:43:44 CDT 2024 aarch64 GNU/Linux
IMPORTANT NOTE: The following results were taken with the NUMA Emulation patch applied, but without adding numa=fake=4
to cmdline.txt
. See follow-up comment below with results after setting that parameter.
Type | Run 1 | Run 2 | Run 3 | Average |
---|---|---|---|---|
Single | 795 | 801 | 802 | 799 |
Multi | 1636 | 1626 | 1638 | 1633 |
Link | result | result | result | - |
Single core: 0.25% slower Multicore: 5.36% slower
Stat | Run 1 | Run 2 | Run 3 | Average |
---|---|---|---|---|
Power (avg) | 11.4W | 11.0W | 11.1W | 11.2W |
Result | 31.348 Gflops | 30.621 Gflops | 30.958 Gflops | 30.976 Gflops |
Efficiency | 2.75 Gflops/W | 2.78 Gflops/W | 2.78 Gflops/W | 2.77 Gflops/W |
Result: 8.00% faster Efficiency: 8.66% more efficient
I also ran Geekbench 6 just after boot (1 min uptime) with the NUMA patch in place. Here's the result: https://browser.geekbench.com/v6/cpu/6820837 (801 / 1636).
And another Geekbench 6 run about 1 hour after boot, after cooldown period of 10 minutes after all the previous tests: https://browser.geekbench.com/v6/cpu/6821505 (799 / 1637). So no noticeable difference at least on this Pi 5 8GB running this Linux kernel between runs immediately following boot and runs much later.
Going to move some other performance testing over to https://github.com/geerlingguy/sbc-reviews/issues/21
Hi @geerlingguy, I'm running through your steps and I think we also need to add numa=fake=4
to the cmdline.txt.
@will127534 - heh... as I was writing up a bit of a post on this... I realized that exact step was missing. I'm going to re-test now. Adding numa=fake=4
to /boot/firmware/config.txt
and rebooting, I now see:
pi@pi5:~ $ dmesg
...
[ 0.000000] NUMA: No NUMA configuration found
[ 0.000000] Faking a node at [mem 0x0000000000000000-0x000000007fffffff]
[ 0.000000] Faking a node at [mem 0x0000000080000000-0x00000000ffffffff]
[ 0.000000] Faking a node at [mem 0x0000000100000000-0x000000017fffffff]
[ 0.000000] Faking a node at [mem 0x0000000180000000-0x00000001ffffffff]
...
[ 0.000000] Kernel command line: reboot=w coherent_pool=1M 8250.nr_uarts=1 pci=pcie_bus_safe smsc95xx.macaddr=D8:3A:DD:84:FB:3A vc_mem.mem_base=0x3fc00000 vc_mem.mem_size=0x40000000 console=ttyAMA10,115200 console=tty1 root=PARTUUID=9f1af6e7-02 rootfstype=ext4 fsck.repair=yes numa=fake=4 rootwait
Run with: numactl --interleave=all ./geekbench6
— installed with sudo apt install -y numactl
.
Type | Run 1 | Run 2 | Run 3 | Average |
---|---|---|---|---|
Single | 854 | 853 | 851 | 853 |
Multi | 1949 | 1947 | 1936 | 1944 |
Link | result | result | result | - |
Single core: 6.29% faster Multicore: 12.05% faster
Modified main.yml
playbook mpirun
command to have prepended numactl --interleave=all
.
Stat | Run 1 | Run 2 | Run 3 | Average |
---|---|---|---|---|
Power (avg) | 12.0W | 12.1W | 12.0W | 12.0W |
Result | 33.204 Gflops | 33.194 Gflops | 33.143 Gflops | 33.180 Gflops |
Efficiency | 2.78 Gflops/W | 2.74 Gflops/W | 2.76 Gflops/W | 2.76 Gflops/W |
Result: 14.85% faster Efficiency: 8% more efficient
I would like to see if the NUMA patch here: https://lore.kernel.org/lkml/20240625125803.38038-1-tursulin@igalia.com/ — has any bearing on HPL performance and/or efficiency scores. Especially if it's reproducible and significant.
The stated numbers for Geekbench 6 are 5-ish and 20-ish percent improvements for single/multicore. I would like to see if there's any impact for HPL (which is inherently multicore, and very RAM-speed-dependent). Also measure the power usage to see if this affects power draw positively, negatively, or not at all.
NOTE: I'm testing with an 8GB Raspberry Pi 5. Default clocks, Raspberry Pi 5 Active Cooler, ambient temperature 80°F/26.7°C.