Closed geerlingguy closed 1 year ago
TASK [Output the results.] *********************************************************************************************
ok: [127.0.0.1] =>
mpirun_output.stdout: |-
================================================================================
HPLinpack 2.3 -- High-Performance Linpack benchmark -- December 2, 2018
Written by A. Petitet and R. Clint Whaley, Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================
An explanation of the input/output parameters follows:
T/V : Wall time / encoded variant.
N : The order of the coefficient matrix A.
NB : The partitioning blocking factor.
P : The number of process rows.
Q : The number of process columns.
Time : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.
The following parameter values will be used:
N : 23314
NB : 192
PMAP : Row-major process mapping
P : 2
Q : 2
PFACT : Right
NBMIN : 4
NDIV : 2
RFACT : Crout
BCAST : 1ringM
DEPTH : 1
SWAP : Mix (threshold = 64)
L1 : transposed form
U : transposed form
EQUIL : yes
ALIGN : 8 double precision words
--------------------------------------------------------------------------------
- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be 1.110223e-16
- Computational tests pass if scaled residuals are less than 16.0
================================================================================
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WR11C2R4 23314 192 2 2 80.71 1.0468e+02
HPL_pdgesv() start time Wed Nov 9 01:31:56 2022
HPL_pdgesv() end time Wed Nov 9 01:33:16 2022
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 1.20055447e-03 ...... PASSED
================================================================================
Finished 1 tests with the following results:
1 tests completed and passed residual checks,
0 tests completed and failed residual checks,
0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------
End of Tests.
================================================================================
That's the setup on my own M2 MacBook Air, at least, (104.68 Gflops, running a bit throttled under Docker for Mac) using a new set of options for calculating values for Ns and NPs in hpl.dat
I can also say the M2 SoC gets cooked when running this benchmark—on my M2 Air, it jumps to 105°C immediately and stays there while the thing throttles through the rest of the cycle.
Creating a new repo for the actual benchmarking automation: https://github.com/geerlingguy/top500-benchmark
Using that project with the following hosts.ini
, I could manage the cluster after flashing Pi OS to each of the CM4s and booting off NVMe (I had to update their bootloaders to prefer NVMe):
[cluster]
node1.local
node2.local
node3.local
node4.local
node5.local
node6.local
[cluster:vars]
ansible_user=pi
Then I can do common commands with Ansible too, like:
# Ping all nodes.
ansible cluster -m ping
# Get temperature of each Pi.
ansible cluster -a "vcgencmd measure_temp"
# Shut down all the Pis.
ansible cluster -m shutdown -b
First run, a bit un-optimized (been pouring through HPL docs and blog posts to make sure I can get a good snapshot and try to generalize it for other clusters and core counts):
================================================================================
HPLinpack 2.3 -- High-Performance Linpack benchmark -- December 2, 2018
Written by A. Petitet and R. Clint Whaley, Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================
An explanation of the input/output parameters follows:
T/V : Wall time / encoded variant.
N : The order of the coefficient matrix A.
NB : The partitioning blocking factor.
P : The number of process rows.
Q : The number of process columns.
Time : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.
The following parameter values will be used:
N : 51080
NB : 192
PMAP : Row-major process mapping
P : 4
Q : 6
PFACT : Right
NBMIN : 4
NDIV : 2
RFACT : Crout
BCAST : 1ringM
DEPTH : 1
SWAP : Mix (threshold = 64)
L1 : transposed form
U : transposed form
EQUIL : yes
ALIGN : 8 double precision words
--------------------------------------------------------------------------------
- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be 1.110223e-16
- Computational tests pass if scaled residuals are less than 16.0
================================================================================
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WR11C2R4 51080 192 4 6 1541.70 5.7634e+01
HPL_pdgesv() start time Wed Nov 16 11:35:23 2022
HPL_pdgesv() end time Wed Nov 16 12:01:05 2022
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 6.99743054e-04 ...... PASSED
================================================================================
Finished 1 tests with the following results:
1 tests completed and passed residual checks,
0 tests completed and failed residual checks,
0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------
End of Tests.
================================================================================
57.634 Gigaflops at 1.5 GHz default clock speeds. Not bad at all, and I haven't fully tuned the parameters for HPL.dat.
51080
and NB 192
: 57.634 Gflops44236
and NB 192
: 55.698 Gflops52133
and NB 256
on 5 nodes (since I don't have 6x 8 GB CM4s): 50.629 Gflops51080
and NB 256
: 60.293 Gflops (40W - spiking to 41W)51080
and NB 256
with 2.0 GHz overclock: 70.338 Gflops (51W)For the overclock, the first time I ran it in the unmodified Super6c enclosure, I noticed power consumption was varying between 40-53W, with occasional spikes to 64W! I then checked if the Pis were throttling, and they definitely were, hitting 85°C quite frequently. So I popped off the top cover and put on a 140mm fan and the temps stayed down in the mid-50s for the final run.
I think I have the data I'm looking for ;)
I really really wish I had a 6th 8 GB CM4, but I haven't been able to pick one up in over year, so oh well!
As the title says, adapt it from: https://github.com/geerlingguy/turing-pi-2-cluster/tree/master/benchmark