Benchmark Raspberry Pi 5 Linux kernel NUMA patch

geerlingguy commented 4 months ago

I would like to see if the NUMA patch here: https://lore.kernel.org/lkml/20240625125803.38038-1-tursulin@igalia.com/ — has any bearing on HPL performance and/or efficiency scores. Especially if it's reproducible and significant.

The stated numbers for Geekbench 6 are 5-ish and 20-ish percent improvements for single/multicore. I would like to see if there's any impact for HPL (which is inherently multicore, and very RAM-speed-dependent). Also measure the power usage to see if this affects power draw positively, negatively, or not at all.

NOTE: I'm testing with an 8GB Raspberry Pi 5. Default clocks, Raspberry Pi 5 Active Cooler, ambient temperature 80°F/26.7°C.

geerlingguy commented 4 months ago

Baseline

pi@pi5:~/linux $ uname -a
Linux pi5 6.6.31+rpt-rpi-2712 #1 SMP PREEMPT Debian 1:6.6.31-1+rpt1 (2024-05-29) aarch64 GNU/Linux

Geekbench 6

Type	Run 1	Run 2	Run 3	Average
Single	802	799	801	801
Multi	1721	1717	1731	1723
Link	result	result	result	-

HPL / Top 500

Stat	Run 1	Run 2	Run 3	Average
Power (avg)	11.5W	11.2W	11.1W	11.3W
Result	28.839 Gflops	28.395 Gflops	28.545 Gflops	28.593 Gflops
Efficiency	2.51 Gflops/W	2.53 Gflops/W	2.57 Gflops/W	2.54 Gflops/W

Click to show representative result

``` ================================================================================ HPLinpack 2.3 -- High-Performance Linpack benchmark -- December 2, 2018 Written by A. Petitet and R. Clint Whaley, Innovative Computing Laboratory, UTK Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK Modified by Julien Langou, University of Colorado Denver ================================================================================ An explanation of the input/output parameters follows: T/V : Wall time / encoded variant. N : The order of the coefficient matrix A. NB : The partitioning blocking factor. P : The number of process rows. Q : The number of process columns. Time : Time in seconds to solve the linear system. Gflops : Rate of execution for solving the linear system. The following parameter values will be used: N : 23314 NB : 256 PMAP : Row-major process mapping P : 1 Q : 4 PFACT : Right NBMIN : 4 NDIV : 2 RFACT : Crout BCAST : 1ringM DEPTH : 1 SWAP : Mix (threshold = 64) L1 : transposed form U : transposed form EQUIL : yes ALIGN : 8 double precision words -------------------------------------------------------------------------------- - The matrix A is randomly generated for each test. - The following scaled residual check will be computed: ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N ) - The relative machine precision (eps) is taken to be 1.110223e-16 - Computational tests pass if scaled residuals are less than 16.0 ================================================================================ T/V N NB P Q Time Gflops -------------------------------------------------------------------------------- WR11C2R4 23314 256 1 4 292.97 2.8839e+01 HPL_pdgesv() start time Fri Jul 5 22:52:13 2024 HPL_pdgesv() end time Fri Jul 5 22:57:06 2024 -------------------------------------------------------------------------------- ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 3.83945609e-03 ...... PASSED ================================================================================ Finished 1 tests with the following results: 1 tests completed and passed residual checks, 0 tests completed and failed residual checks, 0 tests skipped because of illegal input values. -------------------------------------------------------------------------------- End of Tests. ================================================================================ ```

Power consumption graph showing HPL run tail end, and two of the Geekbench 6 runs. System uptime is over 24 hours.

geerlingguy commented 4 months ago

After applying NUMA patch

# After a quick rebuild of the current Pi 64-bit kernel source:

pi@pi5:~ $ uname -a
Linux pi5 6.6.36-v8-16k+ #1 SMP PREEMPT Sat Jul  6 23:19:18 CDT 2024 aarch64 GNU/Linux

Download mbox.gz from this patch thread.
On the Pi, in the linux checkout from Raspberry Pi's build the Linux kernel guide, run: git am PATCH-2-2-arm64-numa-Add-NUMA-emulation-for-ARM64.mbox (skip empty messages)
Configure NUMA Emulation with make menuconfig (requires libncurses-dev be installed via apt)
1. Enable "Kernel Features" > "NUMA Memory Allocation and Scheduler Support" (and enable "NUMA emulation" when it appears)
2. Save the config and exit.
Rebuild the kernel and reboot.

# After rebuilding the kernel with the NUMA patch:

pi@pi5:~ $ uname -a
Linux pi5 6.6.36-v8-16k+ #2 SMP PREEMPT Sun Jul  7 00:43:44 CDT 2024 aarch64 GNU/Linux

IMPORTANT NOTE: The following results were taken with the NUMA Emulation patch applied, but without adding numa=fake=4 to cmdline.txt. See follow-up comment below with results after setting that parameter.

Geekbench 6

Type	Run 1	Run 2	Run 3	Average
Single	795	801	802	799
Multi	1636	1626	1638	1633
Link	result	result	result	-

Single core: 0.25% slower Multicore: 5.36% slower

HPL / Top 500

Stat	Run 1	Run 2	Run 3	Average
Power (avg)	11.4W	11.0W	11.1W	11.2W
Result	31.348 Gflops	30.621 Gflops	30.958 Gflops	30.976 Gflops
Efficiency	2.75 Gflops/W	2.78 Gflops/W	2.78 Gflops/W	2.77 Gflops/W

Result: 8.00% faster Efficiency: 8.66% more efficient

Click to show representative result

``` ================================================================================ HPLinpack 2.3 -- High-Performance Linpack benchmark -- December 2, 2018 Written by A. Petitet and R. Clint Whaley, Innovative Computing Laboratory, UTK Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK Modified by Julien Langou, University of Colorado Denver ================================================================================ An explanation of the input/output parameters follows: T/V : Wall time / encoded variant. N : The order of the coefficient matrix A. NB : The partitioning blocking factor. P : The number of process rows. Q : The number of process columns. Time : Time in seconds to solve the linear system. Gflops : Rate of execution for solving the linear system. The following parameter values will be used: N : 23314 NB : 256 PMAP : Row-major process mapping P : 1 Q : 4 PFACT : Right NBMIN : 4 NDIV : 2 RFACT : Crout BCAST : 1ringM DEPTH : 1 SWAP : Mix (threshold = 64) L1 : transposed form U : transposed form EQUIL : yes ALIGN : 8 double precision words -------------------------------------------------------------------------------- - The matrix A is randomly generated for each test. - The following scaled residual check will be computed: ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N ) - The relative machine precision (eps) is taken to be 1.110223e-16 - Computational tests pass if scaled residuals are less than 16.0 ================================================================================ T/V N NB P Q Time Gflops -------------------------------------------------------------------------------- WR11C2R4 23314 256 1 4 269.52 3.1348e+01 HPL_pdgesv() start time Sun Jul 7 15:05:55 2024 HPL_pdgesv() end time Sun Jul 7 15:10:25 2024 -------------------------------------------------------------------------------- ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 3.83945609e-03 ...... PASSED ================================================================================ Finished 1 tests with the following results: 1 tests completed and passed residual checks, 0 tests completed and failed residual checks, 0 tests skipped because of illegal input values. -------------------------------------------------------------------------------- End of Tests. ================================================================================ ```

geerlingguy commented 4 months ago

I also ran Geekbench 6 just after boot (1 min uptime) with the NUMA patch in place. Here's the result: https://browser.geekbench.com/v6/cpu/6820837 (801 / 1636).

geerlingguy commented 4 months ago

And another Geekbench 6 run about 1 hour after boot, after cooldown period of 10 minutes after all the previous tests: https://browser.geekbench.com/v6/cpu/6821505 (799 / 1637). So no noticeable difference at least on this Pi 5 8GB running this Linux kernel between runs immediately following boot and runs much later.

Going to move some other performance testing over to https://github.com/geerlingguy/sbc-reviews/issues/21

will127534 commented 4 months ago

Hi @geerlingguy, I'm running through your steps and I think we also need to add numa=fake=4 to the cmdline.txt.

geerlingguy commented 4 months ago

@will127534 - heh... as I was writing up a bit of a post on this... I realized that exact step was missing. I'm going to re-test now. Adding numa=fake=4 to /boot/firmware/config.txt and rebooting, I now see:

pi@pi5:~ $ dmesg
...
[    0.000000] NUMA: No NUMA configuration found
[    0.000000] Faking a node at [mem 0x0000000000000000-0x000000007fffffff]
[    0.000000] Faking a node at [mem 0x0000000080000000-0x00000000ffffffff]
[    0.000000] Faking a node at [mem 0x0000000100000000-0x000000017fffffff]
[    0.000000] Faking a node at [mem 0x0000000180000000-0x00000001ffffffff]
...
[    0.000000] Kernel command line: reboot=w coherent_pool=1M 8250.nr_uarts=1 pci=pcie_bus_safe  smsc95xx.macaddr=D8:3A:DD:84:FB:3A vc_mem.mem_base=0x3fc00000 vc_mem.mem_size=0x40000000  console=ttyAMA10,115200 console=tty1 root=PARTUUID=9f1af6e7-02 rootfstype=ext4 fsck.repair=yes numa=fake=4 rootwait

Geekbench 6

Run with: numactl --interleave=all ./geekbench6 — installed with sudo apt install -y numactl.

Type	Run 1	Run 2	Run 3	Average
Single	854	853	851	853
Multi	1949	1947	1936	1944
Link	result	result	result	-

Single core: 6.29% faster Multicore: 12.05% faster

HPL / Top 500

Modified main.yml playbook mpirun command to have prepended numactl --interleave=all.

Stat	Run 1	Run 2	Run 3	Average
Power (avg)	12.0W	12.1W	12.0W	12.0W
Result	33.204 Gflops	33.194 Gflops	33.143 Gflops	33.180 Gflops
Efficiency	2.78 Gflops/W	2.74 Gflops/W	2.76 Gflops/W	2.76 Gflops/W

Result: 14.85% faster Efficiency: 8% more efficient

Click to show representative result

``` ================================================================================ HPLinpack 2.3 -- High-Performance Linpack benchmark -- December 2, 2018 Written by A. Petitet and R. Clint Whaley, Innovative Computing Laboratory, UTK Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK Modified by Julien Langou, University of Colorado Denver ================================================================================ An explanation of the input/output parameters follows: T/V : Wall time / encoded variant. N : The order of the coefficient matrix A. NB : The partitioning blocking factor. P : The number of process rows. Q : The number of process columns. Time : Time in seconds to solve the linear system. Gflops : Rate of execution for solving the linear system. The following parameter values will be used: N : 23314 NB : 256 PMAP : Row-major process mapping P : 1 Q : 4 PFACT : Right NBMIN : 4 NDIV : 2 RFACT : Crout BCAST : 1ringM DEPTH : 1 SWAP : Mix (threshold = 64) L1 : transposed form U : transposed form EQUIL : yes ALIGN : 8 double precision words -------------------------------------------------------------------------------- - The matrix A is randomly generated for each test. - The following scaled residual check will be computed: ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N ) - The relative machine precision (eps) is taken to be 1.110223e-16 - Computational tests pass if scaled residuals are less than 16.0 ================================================================================ T/V N NB P Q Time Gflops -------------------------------------------------------------------------------- WR11C2R4 23314 256 1 4 254.46 3.3204e+01 HPL_pdgesv() start time Mon Jul 8 13:15:36 2024 HPL_pdgesv() end time Mon Jul 8 13:19:51 2024 -------------------------------------------------------------------------------- ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 3.83945609e-03 ...... PASSED ================================================================================ Finished 1 tests with the following results: 1 tests completed and passed residual checks, 0 tests completed and failed residual checks, 0 tests skipped because of illegal input values. -------------------------------------------------------------------------------- End of Tests. ================================================================================ ```

geerlingguy / top500-benchmark