Ampere Altra Developer Platform

geerlingguy commented 1 year ago

ampere-altra-radiator-water-cooling-cpu

Basic information

(This thing is... kinda more than a 'board' — but I still want data somewhere, and this is as good a place as any!)

Board URL (official): https://amperecomputing.com/systems/altra/kraken-comhpc-WS
Board purchased from: (delivered by Ampere)
Board purchase date: April 3, 2023
Board specs (as tested): 96-core / 64 GB RAM / 128 GB SSD
Board price (as tested): Approx $5000

Linux/system information

# output of `neofetch`
jgeerling@ampere-altra:~$ neofetch
            .-/+oossssoo+/-.               jgeerling@ampere-altra 
        `:+ssssssssssssssssss+:`           ---------------------- 
      -+ssssssssssssssssssyyssss+-         OS: Ubuntu 22.04.2 LTS aarch64 
    .ossssssssssssssssssdMMMNysssso.       Host: Ampere Altra Developer Platform ES2 
   /ssssssssssshdmmNNmmyNMMMMhssssss/      Kernel: 5.19.0-40-generic 
  +ssssssssshmydMMMMMMMNddddyssssssss+     Uptime: 5 mins 
 /sssssssshNMMMyhhyyyyhmNMMMNhssssssss/    Packages: 1565 (dpkg), 11 (snap) 
.ssssssssdMMMNhsssssssssshNMMMdssssssss.   Shell: bash 5.1.16 
+sssshhhyNMMNyssssssssssssyNMMMysssssss+   Resolution: 1920x1080 
ossyNMMMNyMMhsssssssssssssshmmmhssssssso   Terminal: /dev/pts/1 
ossyNMMMNyMMhsssssssssssssshmmmhssssssso   CPU: (96) @ 2.800GHz 
+sssshhhyNMMNyssssssssssssyNMMMysssssss+   GPU: 0004:02:00.0 ASPEED Technology, Inc. ASPEED Graphics Family 
.ssssssssdMMMNhsssssssssshNMMMdssssssss.   Memory: 2102MiB / 63897MiB 
 /sssssssshNMMMyhhyyyyhdNMMMNhssssssss/
  +sssssssssdmydMMMMMMMMddddyssssssss+                             
   /ssssssssssshdmNNNNmyNMMMMhssssss/                              
    .ossssssssssssssssssdMMMNysssso.
      -+sssssssssssssssssyyyssss+-
        `:+ssssssssssssssssss+:`
            .-/+oossssoo+/-.

# output of `uname -a`
Linux ampere-altra 5.19.0-40-generic #41~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Fri Mar 31 16:02:33 UTC 2 aarch64 aarch64 aarch64 GNU/Linux

Benchmark results

CPU

Configured with 96 GB RAM (6 x 16GB DDR4 ECC Registered DIMMs):

Geekbench: (800 single / 35818 multi - https://browser.geekbench.com/v5/cpu/21323770)
1,188.3 Gflops (geerlingguy/top500-benchmark - result)

Power

Idle power draw (at wall): 64.5 W (85W with 96 GB RAM)
Maximum simulated power draw (stress-ng --matrix 0): 220 W (242W with 96 GB RAM)
During Geekbench multicore benchmark: 156 W (178W with 96 GB RAM)
During top500 HPL benchmark: 296 W (4.01 Gflops/W)

Disk

Transcend 128GB PCIe Gen 3 x4 NVMe SSD (TS128GMTE652T)

Benchmark	Result
fio 1M sequential read	1245 MB/s
iozone 1M random read	1058 MB/s
iozone 1M random write	665 MB/s
iozone 4K random read	72.99 MB/s
iozone 4K random write	246.88 MB/s

curl https://raw.githubusercontent.com/geerlingguy/pi-cluster/master/benchmarks/disk-benchmark.sh | sudo bash

Run benchmark on any attached storage device (e.g. eMMC, microSD, NVMe, SATA) and add results under an additional heading. Download the script with curl -o disk-benchmark.sh [URL_HERE] and run sudo DEVICE_UNDER_TEST=/dev/sda DEVICE_MOUNT_PATH=/mnt/sda1 ./disk-benchmark.sh (assuming the device is sda).

Also consider running PiBenchmarks.com script.

PiBenchmarks.com result: TODO - should be on https://pibenchmarks.com/latest/ soon

     Category                  Test                      Result      
HDParm                    Disk Read                 1533.06 MB/s             
HDParm                    Cached Disk Read          776.86 MB/s              
DD                        Disk Write                407 MB/s                 
FIO                       4k random read            94377 IOPS (377511 KB/s) 
FIO                       4k random write           74202 IOPS (296811 KB/s) 
IOZone                    4k read                   243709 KB/s              
IOZone                    4k write                  198612 KB/s              
IOZone                    4k random read            70575 KB/s               
IOZone                    4k random write           231884 KB/s              

                          Score: 45797

Network

(Everything runs as expected... this thing's a bonafide server!)

GPU

TODO: Haven't determined standardized benchmark yet. See Issue #2.

Memory

tinymembench results:

Click to expand memory benchmark result

``` tinymembench v0.4.10 (simple benchmark for memory throughput and latency) ========================================================================== == Memory bandwidth tests == == == == Note 1: 1MB = 1000000 bytes == == Note 2: Results for 'copy' tests show how many bytes can be == == copied per second (adding together read and writen == == bytes would have provided twice higher numbers) == == Note 3: 2-pass copy means that we are using a small temporary buffer == == to first fetch data into it, and only then write it to the == == destination (source -> L1 cache, L1 cache -> destination) == == Note 4: If sample standard deviation exceeds 0.1%, it is shown in == == brackets == ========================================================================== C copy backwards : 9424.0 MB/s C copy backwards (32 byte blocks) : 9387.8 MB/s C copy backwards (64 byte blocks) : 9390.8 MB/s C copy : 9366.1 MB/s C copy prefetched (32 bytes step) : 9984.4 MB/s C copy prefetched (64 bytes step) : 9984.1 MB/s C 2-pass copy : 6391.4 MB/s C 2-pass copy prefetched (32 bytes step) : 7237.8 MB/s C 2-pass copy prefetched (64 bytes step) : 7489.6 MB/s C fill : 43884.4 MB/s C fill (shuffle within 16 byte blocks) : 43885.4 MB/s C fill (shuffle within 32 byte blocks) : 43884.2 MB/s C fill (shuffle within 64 byte blocks) : 43877.5 MB/s NEON 64x2 COPY : 9961.9 MB/s NEON 64x2x4 COPY : 10091.6 MB/s NEON 64x1x4_x2 COPY : 8171.5 MB/s NEON 64x2 COPY prefetch x2 : 11822.9 MB/s NEON 64x2x4 COPY prefetch x1 : 12123.8 MB/s NEON 64x2 COPY prefetch x1 : 11836.5 MB/s NEON 64x2x4 COPY prefetch x1 : 12122.3 MB/s --- standard memcpy : 9894.0 MB/s standard memset : 44745.2 MB/s --- NEON LDP/STP copy : 9958.0 MB/s NEON LDP/STP copy pldl2strm (32 bytes step) : 11415.6 MB/s NEON LDP/STP copy pldl2strm (64 bytes step) : 11420.5 MB/s NEON LDP/STP copy pldl1keep (32 bytes step) : 11475.2 MB/s NEON LDP/STP copy pldl1keep (64 bytes step) : 11452.9 MB/s NEON LD1/ST1 copy : 10094.8 MB/s NEON STP fill : 44744.7 MB/s NEON STNP fill : 44745.2 MB/s ARM LDP/STP copy : 10136.4 MB/s ARM STP fill : 44731.7 MB/s ARM STNP fill : 44730.0 MB/s ========================================================================== == Memory latency test == == == == Average time is measured for random memory accesses in the buffers == == of different sizes. The larger is the buffer, the more significant == == are relative contributions of TLB, L1/L2 cache misses and SDRAM == == accesses. For extremely large buffer sizes we are expecting to see == == page table walk with several requests to SDRAM for almost every == == memory access (though 64MiB is not nearly large enough to experience == == this effect to its fullest). == == == == Note 1: All the numbers are representing extra time, which needs to == == be added to L1 cache latency. The cycle timings for L1 cache == == latency can be usually found in the processor documentation. == == Note 2: Dual random read means that we are simultaneously performing == == two independent memory accesses at a time. In the case if == == the memory subsystem can't handle multiple outstanding == == requests, dual random read has the same timings as two == == single reads performed one after another. == ========================================================================== block size : single random read / dual random read, [MADV_NOHUGEPAGE] 1024 : 0.0 ns / 0.0 ns 2048 : 0.0 ns / 0.0 ns 4096 : 0.0 ns / 0.0 ns 8192 : 0.0 ns / 0.0 ns 16384 : 0.0 ns / 0.0 ns 32768 : 0.0 ns / 0.0 ns 65536 : 0.0 ns / 0.0 ns 131072 : 1.3 ns / 1.8 ns 262144 : 2.3 ns / 2.9 ns 524288 : 3.2 ns / 3.9 ns 1048576 : 3.6 ns / 4.2 ns 2097152 : 22.9 ns / 33.0 ns 4194304 : 32.6 ns / 40.9 ns 8388608 : 38.1 ns / 43.5 ns 16777216 : 43.2 ns / 48.6 ns 33554432 : 86.2 ns / 112.2 ns 67108864 : 109.3 ns / 135.2 ns block size : single random read / dual random read, [MADV_HUGEPAGE] 1024 : 0.0 ns / 0.0 ns 2048 : 0.0 ns / 0.0 ns 4096 : 0.0 ns / 0.0 ns 8192 : 0.0 ns / 0.0 ns 16384 : 0.0 ns / 0.0 ns 32768 : 0.0 ns / 0.0 ns 65536 : 0.0 ns / 0.0 ns 131072 : 1.3 ns / 1.8 ns 262144 : 1.9 ns / 2.3 ns 524288 : 2.2 ns / 2.5 ns 1048576 : 2.6 ns / 2.8 ns 2097152 : 21.6 ns / 31.6 ns 4194304 : 31.1 ns / 39.4 ns 8388608 : 35.8 ns / 41.7 ns 16777216 : 38.5 ns / 43.0 ns 33554432 : 79.9 ns / 104.9 ns 67108864 : 101.1 ns / 125.4 ns ```

geerlingguy commented 9 months ago

New result is 1.188 Tflops at 296W, see: https://github.com/geerlingguy/top500-benchmark/issues/10

geerlingguy commented 9 months ago

I've also just upgraded the system to some new Samsung 64GB RAM sticks (for a total of 384 GB of RAM, sheesh), and here's the tinymembench result:

Click to show tinymembench results

``` tinymembench v0.4.10 (simple benchmark for memory throughput and latency) ========================================================================== == Memory bandwidth tests == == == == Note 1: 1MB = 1000000 bytes == == Note 2: Results for 'copy' tests show how many bytes can be == == copied per second (adding together read and writen == == bytes would have provided twice higher numbers) == == Note 3: 2-pass copy means that we are using a small temporary buffer == == to first fetch data into it, and only then write it to the == == destination (source -> L1 cache, L1 cache -> destination) == == Note 4: If sample standard deviation exceeds 0.1%, it is shown in == == brackets == ========================================================================== C copy backwards : 11398.5 MB/s C copy backwards (32 byte blocks) : 12094.4 MB/s (2.7%) C copy backwards (64 byte blocks) : 12098.9 MB/s C copy : 12078.2 MB/s C copy prefetched (32 bytes step) : 12680.1 MB/s C copy prefetched (64 bytes step) : 12693.2 MB/s C 2-pass copy : 7745.6 MB/s C 2-pass copy prefetched (32 bytes step) : 8477.2 MB/s C 2-pass copy prefetched (64 bytes step) : 8861.6 MB/s C fill : 43883.3 MB/s C fill (shuffle within 16 byte blocks) : 43885.8 MB/s C fill (shuffle within 32 byte blocks) : 43886.0 MB/s C fill (shuffle within 64 byte blocks) : 43882.4 MB/s NEON 64x2 COPY : 13111.5 MB/s NEON 64x2x4 COPY : 13275.7 MB/s NEON 64x1x4_x2 COPY : 6742.8 MB/s (0.7%) NEON 64x2 COPY prefetch x2 : 15424.3 MB/s NEON 64x2x4 COPY prefetch x1 : 15672.9 MB/s NEON 64x2 COPY prefetch x1 : 15465.1 MB/s NEON 64x2x4 COPY prefetch x1 : 15671.8 MB/s --- standard memcpy : 13003.2 MB/s standard memset : 44746.4 MB/s --- NEON LDP/STP copy : 13108.3 MB/s NEON LDP/STP copy pldl2strm (32 bytes step) : 14612.1 MB/s NEON LDP/STP copy pldl2strm (64 bytes step) : 14662.0 MB/s NEON LDP/STP copy pldl1keep (32 bytes step) : 14886.2 MB/s NEON LDP/STP copy pldl1keep (64 bytes step) : 14860.1 MB/s NEON LD1/ST1 copy : 13285.5 MB/s NEON STP fill : 44746.8 MB/s NEON STNP fill : 44746.4 MB/s ARM LDP/STP copy : 13305.0 MB/s ARM STP fill : 44730.7 MB/s ARM STNP fill : 44726.3 MB/s ========================================================================== == Memory latency test == == == == Average time is measured for random memory accesses in the buffers == == of different sizes. The larger is the buffer, the more significant == == are relative contributions of TLB, L1/L2 cache misses and SDRAM == == accesses. For extremely large buffer sizes we are expecting to see == == page table walk with several requests to SDRAM for almost every == == memory access (though 64MiB is not nearly large enough to experience == == this effect to its fullest). == == == == Note 1: All the numbers are representing extra time, which needs to == == be added to L1 cache latency. The cycle timings for L1 cache == == latency can be usually found in the processor documentation. == == Note 2: Dual random read means that we are simultaneously performing == == two independent memory accesses at a time. In the case if == == the memory subsystem can't handle multiple outstanding == == requests, dual random read has the same timings as two == == single reads performed one after another. == ========================================================================== block size : single random read / dual random read, [MADV_NOHUGEPAGE] 1024 : 0.0 ns / 0.0 ns 2048 : 0.0 ns / 0.0 ns 4096 : 0.0 ns / 0.0 ns 8192 : 0.0 ns / 0.0 ns 16384 : 0.0 ns / 0.0 ns 32768 : 0.0 ns / 0.0 ns 65536 : 0.0 ns / 0.0 ns 131072 : 1.3 ns / 1.8 ns 262144 : 2.4 ns / 3.0 ns 524288 : 3.4 ns / 3.9 ns 1048576 : 7.6 ns / 10.9 ns 2097152 : 18.6 ns / 26.1 ns 4194304 : 25.6 ns / 32.3 ns 8388608 : 31.3 ns / 36.9 ns 16777216 : 41.4 ns / 51.3 ns 33554432 : 72.6 ns / 94.6 ns 67108864 : 93.0 ns / 113.4 ns block size : single random read / dual random read, [MADV_HUGEPAGE] 1024 : 0.0 ns / 0.0 ns 2048 : 0.0 ns / 0.0 ns 4096 : 0.0 ns / 0.0 ns 8192 : 0.0 ns / 0.0 ns 16384 : 0.0 ns / 0.0 ns 32768 : 0.0 ns / 0.0 ns 65536 : 0.0 ns / 0.0 ns 131072 : 1.3 ns / 1.8 ns 262144 : 1.9 ns / 2.3 ns 524288 : 2.3 ns / 2.5 ns 1048576 : 2.5 ns / 2.7 ns 2097152 : 17.0 ns / 24.4 ns 4194304 : 24.6 ns / 30.5 ns 8388608 : 27.9 ns / 32.4 ns 16777216 : 29.8 ns / 33.7 ns 33554432 : 65.9 ns / 86.9 ns 67108864 : 84.7 ns / 103.9 ns ```

I also re-ran HPL with N=200000, and got 1265.5 Gflops (so almost 1.3 Tflops... still a couple hundred under what Ampere seems to be able to get!):

root@ampere-ubuntu:/opt/hpl-2.3/bin/Altramax_oracleblis# mpirun --allow-run-as-root -np 128 --bind-to core --map-by core ./xhpl
================================================================================
HPLinpack 2.3  --  High-Performance Linpack benchmark  --   December 2, 2018
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :  200000 
NB     :     256 
PMAP   : Row-major process mapping
P      :       8 
Q      :      16 
PFACT  :   Right 
NBMIN  :       4 
NDIV   :       2 
RFACT  :   Crout 
BCAST  :  1ringM 
DEPTH  :       1 
SWAP   : Mix (threshold = 64)
L1     : transposed form
U      : transposed form
EQUIL  : yes
ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR11C2R4      200000   256     8    16            4214.60             1.2655e+03
HPL_pdgesv() start time Fri Sep 15 18:56:44 2023

HPL_pdgesv() end time   Fri Sep 15 20:06:59 2023

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   1.13196075e-02 ...... PASSED
================================================================================

Finished      1 tests with the following results:
              1 tests completed and passed residual checks,
              0 tests completed and failed residual checks,
              0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------

End of Tests.
================================================================================

zjlywjh001 commented 8 months ago

Hi, I'm currently test my AADK with Q80-30, but strangely my score is very low: https://browser.geekbench.com/v5/cpu/21893639. I use 4-channel 32GB 3200 DRAM, total 128GB. But seen a very huge differs from others'. Does anybody know why?

ThomasKaiser commented 8 months ago

I'm currently test my AADK with Q80-30, but strangely my score is very low

Your CPU cores only clock at 2.3 GHz, see the "processor_frequency" node: https://browser.geekbench.com/v5/cpu/21893639.gb5 (you need an user account to see the .gb5 and .gb6 raw data files)

While Jeff's testing with his 2.8 GHz SKU the cores really clocked with those 2.8 GHz: https://browser.geekbench.com/v5/cpu/21323770.gb5

I would always recommend running Geekbench from inside sbc-bench -G since my tool not only measures clockspeeds but also monitors for e.g. swap (which for obvious reasons can also ruin benchmark scores).

zjlywjh001 commented 8 months ago

I'm currently test my AADK with Q80-30, but strangely my score is very low

Your CPU cores only clock at 2.3 GHz, see the "processor_frequency" node: https://browser.geekbench.com/v5/cpu/21893639.gb5 (you need an user account to see the .gb5 and .gb6 raw data files)

While Jeff's testing with his 2.8 GHz SKU the cores really clocked with those 2.8 GHz: https://browser.geekbench.com/v5/cpu/21323770.gb5

I would always recommend running Geekbench from inside sbc-bench -G since my tool not only measures clockspeeds but also monitors for e.g. swap (which for obvious reasons can also ruin benchmark scores).

I still don't know why my CPU only runs at 2.3GHz frequency. In system it reports cpuinfo_max_freq is 3000000 and even if the scaling_cur_freq is set to 3000000, the cpuinfo_cur_freq still only 2300000 . Since My AADK is previously installed a Q32-17(stock shipped with my aadk), it is possible this module is hardware configured for fixed frequency?

geerlingguy / sbc-reviews