Benchmark 6-node Pi cluster with HPL / Linpack

geerlingguy commented 1 year ago

As the title says, adapt it from: https://github.com/geerlingguy/turing-pi-2-cluster/tree/master/benchmark

geerlingguy commented 1 year ago

TASK [Output the results.] *********************************************************************************************
ok: [127.0.0.1] => 
  mpirun_output.stdout: |-
    ================================================================================
    HPLinpack 2.3  --  High-Performance Linpack benchmark  --   December 2, 2018
    Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
    Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
    Modified by Julien Langou, University of Colorado Denver
    ================================================================================

    An explanation of the input/output parameters follows:
    T/V    : Wall time / encoded variant.
    N      : The order of the coefficient matrix A.
    NB     : The partitioning blocking factor.
    P      : The number of process rows.
    Q      : The number of process columns.
    Time   : Time in seconds to solve the linear system.
    Gflops : Rate of execution for solving the linear system.

    The following parameter values will be used:

    N      :   23314
    NB     :     192
    PMAP   : Row-major process mapping
    P      :       2
    Q      :       2
    PFACT  :   Right
    NBMIN  :       4
    NDIV   :       2
    RFACT  :   Crout
    BCAST  :  1ringM
    DEPTH  :       1
    SWAP   : Mix (threshold = 64)
    L1     : transposed form
    U      : transposed form
    EQUIL  : yes
    ALIGN  : 8 double precision words

    --------------------------------------------------------------------------------

    - The matrix A is randomly generated for each test.
    - The following scaled residual check will be computed:
          ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
    - The relative machine precision (eps) is taken to be               1.110223e-16
    - Computational tests pass if scaled residuals are less than                16.0

    ================================================================================
    T/V                N    NB     P     Q               Time                 Gflops
    --------------------------------------------------------------------------------
    WR11C2R4       23314   192     2     2              80.71             1.0468e+02
    HPL_pdgesv() start time Wed Nov  9 01:31:56 2022

    HPL_pdgesv() end time   Wed Nov  9 01:33:16 2022

    --------------------------------------------------------------------------------
    ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   1.20055447e-03 ...... PASSED
    ================================================================================

    Finished      1 tests with the following results:
                  1 tests completed and passed residual checks,
                  0 tests completed and failed residual checks,
                  0 tests skipped because of illegal input values.
    --------------------------------------------------------------------------------

    End of Tests.
    ================================================================================

That's the setup on my own M2 MacBook Air, at least, (104.68 Gflops, running a bit throttled under Docker for Mac) using a new set of options for calculating values for Ns and NPs in hpl.dat

I can also say the M2 SoC gets cooked when running this benchmark—on my M2 Air, it jumps to 105°C immediately and stays there while the thing throttles through the rest of the cycle.

geerlingguy commented 1 year ago

Creating a new repo for the actual benchmarking automation: https://github.com/geerlingguy/top500-benchmark

geerlingguy commented 1 year ago

Using that project with the following hosts.ini, I could manage the cluster after flashing Pi OS to each of the CM4s and booting off NVMe (I had to update their bootloaders to prefer NVMe):

[cluster]
node1.local
node2.local
node3.local
node4.local
node5.local
node6.local

[cluster:vars]
ansible_user=pi

Then I can do common commands with Ansible too, like:

# Ping all nodes.
ansible cluster -m ping

# Get temperature of each Pi.
ansible cluster -a "vcgencmd measure_temp"

# Shut down all the Pis.
ansible cluster -m shutdown -b

geerlingguy commented 1 year ago

First run, a bit un-optimized (been pouring through HPL docs and blog posts to make sure I can get a good snapshot and try to generalize it for other clusters and core counts):

    ================================================================================
    HPLinpack 2.3  --  High-Performance Linpack benchmark  --   December 2, 2018
    Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
    Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
    Modified by Julien Langou, University of Colorado Denver
    ================================================================================

    An explanation of the input/output parameters follows:
    T/V    : Wall time / encoded variant.
    N      : The order of the coefficient matrix A.
    NB     : The partitioning blocking factor.
    P      : The number of process rows.
    Q      : The number of process columns.
    Time   : Time in seconds to solve the linear system.
    Gflops : Rate of execution for solving the linear system.

    The following parameter values will be used:

    N      :   51080
    NB     :     192
    PMAP   : Row-major process mapping
    P      :       4
    Q      :       6
    PFACT  :   Right
    NBMIN  :       4
    NDIV   :       2
    RFACT  :   Crout
    BCAST  :  1ringM
    DEPTH  :       1
    SWAP   : Mix (threshold = 64)
    L1     : transposed form
    U      : transposed form
    EQUIL  : yes
    ALIGN  : 8 double precision words

    --------------------------------------------------------------------------------

    - The matrix A is randomly generated for each test.
    - The following scaled residual check will be computed:
          ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
    - The relative machine precision (eps) is taken to be               1.110223e-16
    - Computational tests pass if scaled residuals are less than                16.0

    ================================================================================
    T/V                N    NB     P     Q               Time                 Gflops
    --------------------------------------------------------------------------------
    WR11C2R4       51080   192     4     6            1541.70             5.7634e+01
    HPL_pdgesv() start time Wed Nov 16 11:35:23 2022

    HPL_pdgesv() end time   Wed Nov 16 12:01:05 2022

    --------------------------------------------------------------------------------
    ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   6.99743054e-04 ...... PASSED
    ================================================================================

    Finished      1 tests with the following results:
                  1 tests completed and passed residual checks,
                  0 tests completed and failed residual checks,
                  0 tests skipped because of illegal input values.
    --------------------------------------------------------------------------------

    End of Tests.
    ================================================================================

57.634 Gigaflops at 1.5 GHz default clock speeds. Not bad at all, and I haven't fully tuned the parameters for HPL.dat.

geerlingguy commented 1 year ago

1st run with Ns 51080 and NB 192: 57.634 Gflops
2nd run with Ns 44236 and NB 192: 55.698 Gflops
3rd run with Ns 52133 and NB 256 on 5 nodes (since I don't have 6x 8 GB CM4s): 50.629 Gflops
4th run with Ns 51080 and NB 256: 60.293 Gflops (40W - spiking to 41W)
5th run with Ns 51080 and NB 256 with 2.0 GHz overclock: 70.338 Gflops (51W)

For the overclock, the first time I ran it in the unmodified Super6c enclosure, I noticed power consumption was varying between 40-53W, with occasional spikes to 64W! I then checked if the Pis were throttling, and they definitely were, hitting 85°C quite frequently. So I popped off the top cover and put on a 140mm fan and the temps stayed down in the mid-50s for the final run.

geerlingguy commented 1 year ago

I think I have the data I'm looking for ;)

I really really wish I had a 6th 8 GB CM4, but I haven't been able to pick one up in over year, so oh well!

geerlingguy / deskpi-super6c-cluster

Benchmark 6-node Pi cluster with HPL / Linpack #4