Benchmark Ampere Altra Developer Platform - 96 core 2.8 GHz ARM64

geerlingguy / top500-benchmark

Automated Top500 benchmark for clusters or single nodes.

MIT License

190 stars 18 forks source link

Benchmark Ampere Altra Developer Platform - 96 core 2.8 GHz ARM64 #10

Closed geerlingguy closed 1 year ago

geerlingguy commented 1 year ago

As the title says...

geerlingguy commented 1 year ago

With Ps: 1 and Qs: 96:

================================================================================
HPLinpack 2.3  --  High-Performance Linpack benchmark  --   December 2, 2018
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :   70717
NB     :     256
PMAP   : Row-major process mapping
P      :       1
Q      :      96
PFACT  :   Right
NBMIN  :       4
NDIV   :       2
RFACT  :   Crout
BCAST  :  1ringM
DEPTH  :       1
SWAP   : Mix (threshold = 64)
L1     : transposed form
U      : transposed form
EQUIL  : yes
ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR11C2R4       70717   256     1    96             625.64             3.7685e+02
HPL_pdgesv() start time Mon Apr 17 09:47:19 2023

HPL_pdgesv() end time   Mon Apr 17 09:57:44 2023

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   7.57859825e-04 ...... PASSED
================================================================================

Finished      1 tests with the following results:
              1 tests completed and passed residual checks,
              0 tests completed and failed residual checks,
              0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------

End of Tests.
================================================================================

Used 220W average (with a few spikes to 236W, but only briefly, and some dips down to 216W). 1.71 Gflops/W

geerlingguy commented 1 year ago

With Ps: 4 and Qs: 24 (since I believe the die is subdivided into 4 quadrants):

================================================================================
HPLinpack 2.3  --  High-Performance Linpack benchmark  --   December 2, 2018
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :   70717
NB     :     256
PMAP   : Row-major process mapping
P      :       4
Q      :      24
PFACT  :   Right
NBMIN  :       4
NDIV   :       2
RFACT  :   Crout
BCAST  :  1ringM
DEPTH  :       1
SWAP   : Mix (threshold = 64)
L1     : transposed form
U      : transposed form
EQUIL  : yes
ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR11C2R4       70717   256     4    24             586.68             4.0188e+02
HPL_pdgesv() start time Mon Apr 17 10:14:23 2023

HPL_pdgesv() end time   Mon Apr 17 10:24:10 2023

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   6.70141896e-04 ...... PASSED
================================================================================

Finished      1 tests with the following results:
              1 tests completed and passed residual checks,
              0 tests completed and failed residual checks,
              0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------

End of Tests.
================================================================================

Used 200W average (with a few spikes to 215W, and some dips down to 196W). 2.01 Gflops/W

geerlingguy commented 1 year ago

After some suggestions from @rbapat-ampere over in https://github.com/geerlingguy/sbc-reviews/issues/19#issuecomment-1549607679, I'm going to be re-testing with a few different parameters, to get a feel for how things change:

Test matrix:

RAM	Ps / Qs	Blis library	Benchmark Result	Power Consumption
64 GB (4x 16 GB)	`4` / `24`	default	401.88 Gflops	202W
64 GB (4x 16 GB)	`8` / `12`	default	394.15 Gflops	200W
96 GB (6x 16 GB)	`4` / `24`	default	600.63 Gflops	235W
96 GB (6x 16 GB)	`8` / `12`	default	582.90 Gflops	232W
96 GB (4x 16 GB)	`8` / `12`	ampere-optimized	985.02 Gflops	270W

Note: For power consumption, I compared a Sonoff S31 power outlet adapter and a Kill-A-Watt power meter, and re-ran all tests on both. They were within 2W in spot measurements, and within 1W in averages over a 1 minute time period.

geerlingguy commented 1 year ago

Trying to get the Ampere-optimized HPL run to work, but currently running into issues: https://github.com/AmpereComputing/HPL-on-Ampere-Altra/issues/3

I was originally going to test things by trying to swap their library into my install, but decided to just end-to-end try testing their docs in that repo.

geerlingguy commented 1 year ago

Result with the ampere-optimized setup following these instructions:

root@adlink-ampere:/opt/hpl-2.3/bin/Altramax_oracleblis# mpirun -np 96 --bind-to core --map-by core --allow-run-as-root ./xhpl
================================================================================
HPLinpack 2.3  --  High-Performance Linpack benchmark  --   December 2, 2018
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :  100000 
NB     :     256 
PMAP   : Row-major process mapping
P      :       8 
Q      :      12 
PFACT  :   Right 
NBMIN  :       4 
NDIV   :       2 
RFACT  :   Crout 
BCAST  :  1ringM 
DEPTH  :       1 
SWAP   : Mix (threshold = 64)
L1     : transposed form
U      : transposed form
EQUIL  : yes
ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR11C2R4      100000   256     8    12             676.82             9.8502e+02
HPL_pdgesv() start time Thu Jun 15 16:18:54 2023

HPL_pdgesv() end time   Thu Jun 15 16:30:11 2023

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   2.01180641e-03 ...... PASSED
================================================================================

Finished      1 tests with the following results:
              1 tests completed and passed residual checks,
              0 tests completed and failed residual checks,
              0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------

End of Tests.
================================================================================

Power measured at average of 270W (spiking to about 272W, dipping to 245W). Efficiency: 3.64 Gflops/W

geerlingguy commented 1 year ago

Just noting that my M1 Max results may also improved from a native BLAS library—see, for example: https://github.com/JuliaLang/julia/issues/42312

rbapat-ampere commented 1 year ago

@geerlingguy Glad to see the jump in scores. But we are still leaving some performance on the table. On the Ampere Altra Developer Platform I was able to get 1253 GFlops for HPL using the optimized BLIS. Here's a screen cap from the document :

geerlingguy commented 1 year ago

@rbapat-ampere - Interesting... I ran with 100000 Ns, and 8/12 P/Q, and that was how I got the 985.02 Gflops. Can you think of anything else I might've missed. I followed the instructions from here explicitly, and ran them all on a brand new fresh Ubuntu 22.04 Server install.

rbapat-ampere commented 1 year ago

@geerlingguy I rebuilt, reran HPL + Optimized BLIS from scratch (using the instructions) on a fresh Ubuntu installed AADP and got very similar scores to my previous results. Here are my current scores

================================================================================
HPLinpack 2.3  --  High-Performance Linpack benchmark  --   December 2, 2018
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :  105000 
NB     :     256 
PMAP   : Row-major process mapping
P      :       8 
Q      :      12 
PFACT  :   Right 
NBMIN  :       4 
NDIV   :       2 
RFACT  :   Crout 
BCAST  :  1ringM 
DEPTH  :       1 
SWAP   : Mix (threshold = 64)
L1     : transposed form
U      : transposed form
EQUIL  : yes
ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR11C2R4      105000   256     8    12             631.81             1.2215e+03
HPL_pdgesv() start time Fri Jun 16 14:10:59 2023

HPL_pdgesv() end time   Fri Jun 16 14:21:31 2023

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   2.00850780e-03 ...... PASSED
================================================================================

Finished      1 tests with the following results:
              1 tests completed and passed residual checks,
              0 tests completed and failed residual checks,
              0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------

End of Tests.
================================================================================

Testbench Details : OS : 22.04.2 LTS (Jammy Jellyfish) : Desktop Image Kernel : 5.19.0-42-generic GCC Toolchain : 12.3.0 openmpi : 4.1.4 Memory used during the test : 89 gig

geerlingguy commented 1 year ago

@rbapat-ampere - I shall reformat and run it again :)

Can you also confirm what RAM layout you're using? Is it 6x 16 GB sticks, or something else? That seems to have an outsize effect on the results.

geerlingguy commented 1 year ago

It seems like the RAM vendor is the only major difference—I'm running industrial-type Transcend RAM, and @rbapat-ampere is running Samsung... I may need to change vendors and see if that gets our numbers more in line (stranger things have happened!).

I'm also planning on re-testing on a 128 core CPU soon too...

geerlingguy commented 1 year ago

New result is 1188.3 Gflops at 296W, for 4.01 Gflops/W

================================================================================
HPLinpack 2.3  --  High-Performance Linpack benchmark  --   December 2, 2018
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :  105000 
NB     :     256 
PMAP   : Row-major process mapping
P      :       8 
Q      :      12 
PFACT  :   Right 
NBMIN  :       4 
NDIV   :       2 
RFACT  :   Crout 
BCAST  :  1ringM 
DEPTH  :       1 
SWAP   : Mix (threshold = 64)
L1     : transposed form
U      : transposed form
EQUIL  : yes
ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR11C2R4      105000   256     8    12             649.46             1.1883e+03
HPL_pdgesv() start time Mon Sep 11 20:21:22 2023

HPL_pdgesv() end time   Mon Sep 11 20:32:11 2023

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   2.00850780e-03 ...... PASSED
================================================================================

Finished      1 tests with the following results:
              1 tests completed and passed residual checks,
              0 tests completed and failed residual checks,
              0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------

End of Tests.
================================================================================

geerlingguy commented 1 year ago

Closing this as we have a result!