geerlingguy / top500-benchmark

Automated Top500 benchmark for clusters or single nodes.
MIT License
159 stars 17 forks source link

Benchmark Raspberry Pi 5 Linux kernel NUMA patch #36

Closed geerlingguy closed 2 months ago

geerlingguy commented 3 months ago

I would like to see if the NUMA patch here: https://lore.kernel.org/lkml/20240625125803.38038-1-tursulin@igalia.com/ — has any bearing on HPL performance and/or efficiency scores. Especially if it's reproducible and significant.

The stated numbers for Geekbench 6 are 5-ish and 20-ish percent improvements for single/multicore. I would like to see if there's any impact for HPL (which is inherently multicore, and very RAM-speed-dependent). Also measure the power usage to see if this affects power draw positively, negatively, or not at all.

NOTE: I'm testing with an 8GB Raspberry Pi 5. Default clocks, Raspberry Pi 5 Active Cooler, ambient temperature 80°F/26.7°C.

geerlingguy commented 3 months ago

Baseline

pi@pi5:~/linux $ uname -a
Linux pi5 6.6.31+rpt-rpi-2712 #1 SMP PREEMPT Debian 1:6.6.31-1+rpt1 (2024-05-29) aarch64 GNU/Linux

Geekbench 6

Type Run 1 Run 2 Run 3 Average
Single 802 799 801 801
Multi 1721 1717 1731 1723
Link result result result -

HPL / Top 500

Stat Run 1 Run 2 Run 3 Average
Power (avg) 11.5W 11.2W 11.1W 11.3W
Result 28.839 Gflops 28.395 Gflops 28.545 Gflops 28.593 Gflops
Efficiency 2.51 Gflops/W 2.53 Gflops/W 2.57 Gflops/W 2.54 Gflops/W
Click to show representative result ``` ================================================================================ HPLinpack 2.3 -- High-Performance Linpack benchmark -- December 2, 2018 Written by A. Petitet and R. Clint Whaley, Innovative Computing Laboratory, UTK Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK Modified by Julien Langou, University of Colorado Denver ================================================================================ An explanation of the input/output parameters follows: T/V : Wall time / encoded variant. N : The order of the coefficient matrix A. NB : The partitioning blocking factor. P : The number of process rows. Q : The number of process columns. Time : Time in seconds to solve the linear system. Gflops : Rate of execution for solving the linear system. The following parameter values will be used: N : 23314 NB : 256 PMAP : Row-major process mapping P : 1 Q : 4 PFACT : Right NBMIN : 4 NDIV : 2 RFACT : Crout BCAST : 1ringM DEPTH : 1 SWAP : Mix (threshold = 64) L1 : transposed form U : transposed form EQUIL : yes ALIGN : 8 double precision words -------------------------------------------------------------------------------- - The matrix A is randomly generated for each test. - The following scaled residual check will be computed: ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N ) - The relative machine precision (eps) is taken to be 1.110223e-16 - Computational tests pass if scaled residuals are less than 16.0 ================================================================================ T/V N NB P Q Time Gflops -------------------------------------------------------------------------------- WR11C2R4 23314 256 1 4 292.97 2.8839e+01 HPL_pdgesv() start time Fri Jul 5 22:52:13 2024 HPL_pdgesv() end time Fri Jul 5 22:57:06 2024 -------------------------------------------------------------------------------- ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 3.83945609e-03 ...... PASSED ================================================================================ Finished 1 tests with the following results: 1 tests completed and passed residual checks, 0 tests completed and failed residual checks, 0 tests skipped because of illegal input values. -------------------------------------------------------------------------------- End of Tests. ================================================================================ ```

Power consumption graph showing HPL run tail end, and two of the Geekbench 6 runs. System uptime is over 24 hours.

Screenshot 2024-07-06 at 10 25 45 PM
geerlingguy commented 3 months ago

After applying NUMA patch

# After a quick rebuild of the current Pi 64-bit kernel source:

pi@pi5:~ $ uname -a
Linux pi5 6.6.36-v8-16k+ #1 SMP PREEMPT Sat Jul  6 23:19:18 CDT 2024 aarch64 GNU/Linux
  1. Download mbox.gz from this patch thread.
  2. On the Pi, in the linux checkout from Raspberry Pi's build the Linux kernel guide, run: git am PATCH-2-2-arm64-numa-Add-NUMA-emulation-for-ARM64.mbox (skip empty messages)
  3. Configure NUMA Emulation with make menuconfig (requires libncurses-dev be installed via apt)
    1. Enable "Kernel Features" > "NUMA Memory Allocation and Scheduler Support" (and enable "NUMA emulation" when it appears)
    2. Save the config and exit.
  4. Rebuild the kernel and reboot.
# After rebuilding the kernel with the NUMA patch:

pi@pi5:~ $ uname -a
Linux pi5 6.6.36-v8-16k+ #2 SMP PREEMPT Sun Jul  7 00:43:44 CDT 2024 aarch64 GNU/Linux

IMPORTANT NOTE: The following results were taken with the NUMA Emulation patch applied, but without adding numa=fake=4 to cmdline.txt. See follow-up comment below with results after setting that parameter.


Geekbench 6

Type Run 1 Run 2 Run 3 Average
Single 795 801 802 799
Multi 1636 1626 1638 1633
Link result result result -

Single core: 0.25% slower Multicore: 5.36% slower

HPL / Top 500

Stat Run 1 Run 2 Run 3 Average
Power (avg) 11.4W 11.0W 11.1W 11.2W
Result 31.348 Gflops 30.621 Gflops 30.958 Gflops 30.976 Gflops
Efficiency 2.75 Gflops/W 2.78 Gflops/W 2.78 Gflops/W 2.77 Gflops/W

Result: 8.00% faster Efficiency: 8.66% more efficient

Click to show representative result ``` ================================================================================ HPLinpack 2.3 -- High-Performance Linpack benchmark -- December 2, 2018 Written by A. Petitet and R. Clint Whaley, Innovative Computing Laboratory, UTK Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK Modified by Julien Langou, University of Colorado Denver ================================================================================ An explanation of the input/output parameters follows: T/V : Wall time / encoded variant. N : The order of the coefficient matrix A. NB : The partitioning blocking factor. P : The number of process rows. Q : The number of process columns. Time : Time in seconds to solve the linear system. Gflops : Rate of execution for solving the linear system. The following parameter values will be used: N : 23314 NB : 256 PMAP : Row-major process mapping P : 1 Q : 4 PFACT : Right NBMIN : 4 NDIV : 2 RFACT : Crout BCAST : 1ringM DEPTH : 1 SWAP : Mix (threshold = 64) L1 : transposed form U : transposed form EQUIL : yes ALIGN : 8 double precision words -------------------------------------------------------------------------------- - The matrix A is randomly generated for each test. - The following scaled residual check will be computed: ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N ) - The relative machine precision (eps) is taken to be 1.110223e-16 - Computational tests pass if scaled residuals are less than 16.0 ================================================================================ T/V N NB P Q Time Gflops -------------------------------------------------------------------------------- WR11C2R4 23314 256 1 4 269.52 3.1348e+01 HPL_pdgesv() start time Sun Jul 7 15:05:55 2024 HPL_pdgesv() end time Sun Jul 7 15:10:25 2024 -------------------------------------------------------------------------------- ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 3.83945609e-03 ...... PASSED ================================================================================ Finished 1 tests with the following results: 1 tests completed and passed residual checks, 0 tests completed and failed residual checks, 0 tests skipped because of illegal input values. -------------------------------------------------------------------------------- End of Tests. ================================================================================ ```
geerlingguy commented 2 months ago

I also ran Geekbench 6 just after boot (1 min uptime) with the NUMA patch in place. Here's the result: https://browser.geekbench.com/v6/cpu/6820837 (801 / 1636).

geerlingguy commented 2 months ago

And another Geekbench 6 run about 1 hour after boot, after cooldown period of 10 minutes after all the previous tests: https://browser.geekbench.com/v6/cpu/6821505 (799 / 1637). So no noticeable difference at least on this Pi 5 8GB running this Linux kernel between runs immediately following boot and runs much later.

Going to move some other performance testing over to https://github.com/geerlingguy/sbc-reviews/issues/21

will127534 commented 2 months ago

Hi @geerlingguy, I'm running through your steps and I think we also need to add numa=fake=4 to the cmdline.txt.

geerlingguy commented 2 months ago

@will127534 - heh... as I was writing up a bit of a post on this... I realized that exact step was missing. I'm going to re-test now. Adding numa=fake=4 to /boot/firmware/config.txt and rebooting, I now see:

pi@pi5:~ $ dmesg
...
[    0.000000] NUMA: No NUMA configuration found
[    0.000000] Faking a node at [mem 0x0000000000000000-0x000000007fffffff]
[    0.000000] Faking a node at [mem 0x0000000080000000-0x00000000ffffffff]
[    0.000000] Faking a node at [mem 0x0000000100000000-0x000000017fffffff]
[    0.000000] Faking a node at [mem 0x0000000180000000-0x00000001ffffffff]
...
[    0.000000] Kernel command line: reboot=w coherent_pool=1M 8250.nr_uarts=1 pci=pcie_bus_safe  smsc95xx.macaddr=D8:3A:DD:84:FB:3A vc_mem.mem_base=0x3fc00000 vc_mem.mem_size=0x40000000  console=ttyAMA10,115200 console=tty1 root=PARTUUID=9f1af6e7-02 rootfstype=ext4 fsck.repair=yes numa=fake=4 rootwait

Geekbench 6

Run with: numactl --interleave=all ./geekbench6 — installed with sudo apt install -y numactl.

Type Run 1 Run 2 Run 3 Average
Single 854 853 851 853
Multi 1949 1947 1936 1944
Link result result result -

Single core: 6.29% faster Multicore: 12.05% faster

HPL / Top 500

Modified main.yml playbook mpirun command to have prepended numactl --interleave=all.

Stat Run 1 Run 2 Run 3 Average
Power (avg) 12.0W 12.1W 12.0W 12.0W
Result 33.204 Gflops 33.194 Gflops 33.143 Gflops 33.180 Gflops
Efficiency 2.78 Gflops/W 2.74 Gflops/W 2.76 Gflops/W 2.76 Gflops/W

Result: 14.85% faster Efficiency: 8% more efficient

Click to show representative result ``` ================================================================================ HPLinpack 2.3 -- High-Performance Linpack benchmark -- December 2, 2018 Written by A. Petitet and R. Clint Whaley, Innovative Computing Laboratory, UTK Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK Modified by Julien Langou, University of Colorado Denver ================================================================================ An explanation of the input/output parameters follows: T/V : Wall time / encoded variant. N : The order of the coefficient matrix A. NB : The partitioning blocking factor. P : The number of process rows. Q : The number of process columns. Time : Time in seconds to solve the linear system. Gflops : Rate of execution for solving the linear system. The following parameter values will be used: N : 23314 NB : 256 PMAP : Row-major process mapping P : 1 Q : 4 PFACT : Right NBMIN : 4 NDIV : 2 RFACT : Crout BCAST : 1ringM DEPTH : 1 SWAP : Mix (threshold = 64) L1 : transposed form U : transposed form EQUIL : yes ALIGN : 8 double precision words -------------------------------------------------------------------------------- - The matrix A is randomly generated for each test. - The following scaled residual check will be computed: ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N ) - The relative machine precision (eps) is taken to be 1.110223e-16 - Computational tests pass if scaled residuals are less than 16.0 ================================================================================ T/V N NB P Q Time Gflops -------------------------------------------------------------------------------- WR11C2R4 23314 256 1 4 254.46 3.3204e+01 HPL_pdgesv() start time Mon Jul 8 13:15:36 2024 HPL_pdgesv() end time Mon Jul 8 13:19:51 2024 -------------------------------------------------------------------------------- ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 3.83945609e-03 ...... PASSED ================================================================================ Finished 1 tests with the following results: 1 tests completed and passed residual checks, 0 tests completed and failed residual checks, 0 tests skipped because of illegal input values. -------------------------------------------------------------------------------- End of Tests. ================================================================================ ```