Open geerlingguy opened 1 year ago
New result is 1.188 Tflops at 296W, see: https://github.com/geerlingguy/top500-benchmark/issues/10
I've also just upgraded the system to some new Samsung 64GB RAM sticks (for a total of 384 GB of RAM, sheesh), and here's the tinymembench result:
I also re-ran HPL with N=200000, and got 1265.5 Gflops (so almost 1.3 Tflops... still a couple hundred under what Ampere seems to be able to get!):
root@ampere-ubuntu:/opt/hpl-2.3/bin/Altramax_oracleblis# mpirun --allow-run-as-root -np 128 --bind-to core --map-by core ./xhpl
================================================================================
HPLinpack 2.3 -- High-Performance Linpack benchmark -- December 2, 2018
Written by A. Petitet and R. Clint Whaley, Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================
An explanation of the input/output parameters follows:
T/V : Wall time / encoded variant.
N : The order of the coefficient matrix A.
NB : The partitioning blocking factor.
P : The number of process rows.
Q : The number of process columns.
Time : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.
The following parameter values will be used:
N : 200000
NB : 256
PMAP : Row-major process mapping
P : 8
Q : 16
PFACT : Right
NBMIN : 4
NDIV : 2
RFACT : Crout
BCAST : 1ringM
DEPTH : 1
SWAP : Mix (threshold = 64)
L1 : transposed form
U : transposed form
EQUIL : yes
ALIGN : 8 double precision words
--------------------------------------------------------------------------------
- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be 1.110223e-16
- Computational tests pass if scaled residuals are less than 16.0
================================================================================
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WR11C2R4 200000 256 8 16 4214.60 1.2655e+03
HPL_pdgesv() start time Fri Sep 15 18:56:44 2023
HPL_pdgesv() end time Fri Sep 15 20:06:59 2023
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 1.13196075e-02 ...... PASSED
================================================================================
Finished 1 tests with the following results:
1 tests completed and passed residual checks,
0 tests completed and failed residual checks,
0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------
End of Tests.
================================================================================
Hi, I'm currently test my AADK with Q80-30, but strangely my score is very low: https://browser.geekbench.com/v5/cpu/21893639. I use 4-channel 32GB 3200 DRAM, total 128GB. But seen a very huge differs from others'. Does anybody know why?
I'm currently test my AADK with Q80-30, but strangely my score is very low
Your CPU cores only clock at 2.3 GHz, see the "processor_frequency"
node: https://browser.geekbench.com/v5/cpu/21893639.gb5 (you need an user account to see the .gb5
and .gb6
raw data files)
While Jeff's testing with his 2.8 GHz SKU the cores really clocked with those 2.8 GHz: https://browser.geekbench.com/v5/cpu/21323770.gb5
I would always recommend running Geekbench from inside sbc-bench -G
since my tool not only measures clockspeeds but also monitors for e.g. swap (which for obvious reasons can also ruin benchmark scores).
I'm currently test my AADK with Q80-30, but strangely my score is very low
Your CPU cores only clock at 2.3 GHz, see the
"processor_frequency"
node: https://browser.geekbench.com/v5/cpu/21893639.gb5 (you need an user account to see the.gb5
and.gb6
raw data files)While Jeff's testing with his 2.8 GHz SKU the cores really clocked with those 2.8 GHz: https://browser.geekbench.com/v5/cpu/21323770.gb5
I would always recommend running Geekbench from inside
sbc-bench -G
since my tool not only measures clockspeeds but also monitors for e.g. swap (which for obvious reasons can also ruin benchmark scores).
I still don't know why my CPU only runs at 2.3GHz frequency. In system it reports cpuinfo_max_freq
is 3000000 and even if the scaling_cur_freq
is set to 3000000, the cpuinfo_cur_freq still only 2300000 . Since My AADK is previously installed a Q32-17(stock shipped with my aadk), it is possible this module is hardware configured for fixed frequency?
Basic information
(This thing is... kinda more than a 'board' — but I still want data somewhere, and this is as good a place as any!)
Linux/system information
Benchmark results
CPU
Configured with 96 GB RAM (6 x 16GB DDR4 ECC Registered DIMMs):
Power
stress-ng --matrix 0
): 220 W (242W with 96 GB RAM)top500
HPL benchmark: 296 W (4.01 Gflops/W)Disk
Transcend 128GB PCIe Gen 3 x4 NVMe SSD (TS128GMTE652T)
curl https://raw.githubusercontent.com/geerlingguy/pi-cluster/master/benchmarks/disk-benchmark.sh | sudo bash
Run benchmark on any attached storage device (e.g. eMMC, microSD, NVMe, SATA) and add results under an additional heading. Download the script with
curl -o disk-benchmark.sh [URL_HERE]
and runsudo DEVICE_UNDER_TEST=/dev/sda DEVICE_MOUNT_PATH=/mnt/sda1 ./disk-benchmark.sh
(assuming the device issda
).Also consider running PiBenchmarks.com script.
PiBenchmarks.com result: TODO - should be on https://pibenchmarks.com/latest/ soon
Network
(Everything runs as expected... this thing's a bonafide server!)
GPU
Memory
tinymembench
results:Click to expand memory benchmark result
``` tinymembench v0.4.10 (simple benchmark for memory throughput and latency) ========================================================================== == Memory bandwidth tests == == == == Note 1: 1MB = 1000000 bytes == == Note 2: Results for 'copy' tests show how many bytes can be == == copied per second (adding together read and writen == == bytes would have provided twice higher numbers) == == Note 3: 2-pass copy means that we are using a small temporary buffer == == to first fetch data into it, and only then write it to the == == destination (source -> L1 cache, L1 cache -> destination) == == Note 4: If sample standard deviation exceeds 0.1%, it is shown in == == brackets == ========================================================================== C copy backwards : 9424.0 MB/s C copy backwards (32 byte blocks) : 9387.8 MB/s C copy backwards (64 byte blocks) : 9390.8 MB/s C copy : 9366.1 MB/s C copy prefetched (32 bytes step) : 9984.4 MB/s C copy prefetched (64 bytes step) : 9984.1 MB/s C 2-pass copy : 6391.4 MB/s C 2-pass copy prefetched (32 bytes step) : 7237.8 MB/s C 2-pass copy prefetched (64 bytes step) : 7489.6 MB/s C fill : 43884.4 MB/s C fill (shuffle within 16 byte blocks) : 43885.4 MB/s C fill (shuffle within 32 byte blocks) : 43884.2 MB/s C fill (shuffle within 64 byte blocks) : 43877.5 MB/s NEON 64x2 COPY : 9961.9 MB/s NEON 64x2x4 COPY : 10091.6 MB/s NEON 64x1x4_x2 COPY : 8171.5 MB/s NEON 64x2 COPY prefetch x2 : 11822.9 MB/s NEON 64x2x4 COPY prefetch x1 : 12123.8 MB/s NEON 64x2 COPY prefetch x1 : 11836.5 MB/s NEON 64x2x4 COPY prefetch x1 : 12122.3 MB/s --- standard memcpy : 9894.0 MB/s standard memset : 44745.2 MB/s --- NEON LDP/STP copy : 9958.0 MB/s NEON LDP/STP copy pldl2strm (32 bytes step) : 11415.6 MB/s NEON LDP/STP copy pldl2strm (64 bytes step) : 11420.5 MB/s NEON LDP/STP copy pldl1keep (32 bytes step) : 11475.2 MB/s NEON LDP/STP copy pldl1keep (64 bytes step) : 11452.9 MB/s NEON LD1/ST1 copy : 10094.8 MB/s NEON STP fill : 44744.7 MB/s NEON STNP fill : 44745.2 MB/s ARM LDP/STP copy : 10136.4 MB/s ARM STP fill : 44731.7 MB/s ARM STNP fill : 44730.0 MB/s ========================================================================== == Memory latency test == == == == Average time is measured for random memory accesses in the buffers == == of different sizes. The larger is the buffer, the more significant == == are relative contributions of TLB, L1/L2 cache misses and SDRAM == == accesses. For extremely large buffer sizes we are expecting to see == == page table walk with several requests to SDRAM for almost every == == memory access (though 64MiB is not nearly large enough to experience == == this effect to its fullest). == == == == Note 1: All the numbers are representing extra time, which needs to == == be added to L1 cache latency. The cycle timings for L1 cache == == latency can be usually found in the processor documentation. == == Note 2: Dual random read means that we are simultaneously performing == == two independent memory accesses at a time. In the case if == == the memory subsystem can't handle multiple outstanding == == requests, dual random read has the same timings as two == == single reads performed one after another. == ========================================================================== block size : single random read / dual random read, [MADV_NOHUGEPAGE] 1024 : 0.0 ns / 0.0 ns 2048 : 0.0 ns / 0.0 ns 4096 : 0.0 ns / 0.0 ns 8192 : 0.0 ns / 0.0 ns 16384 : 0.0 ns / 0.0 ns 32768 : 0.0 ns / 0.0 ns 65536 : 0.0 ns / 0.0 ns 131072 : 1.3 ns / 1.8 ns 262144 : 2.3 ns / 2.9 ns 524288 : 3.2 ns / 3.9 ns 1048576 : 3.6 ns / 4.2 ns 2097152 : 22.9 ns / 33.0 ns 4194304 : 32.6 ns / 40.9 ns 8388608 : 38.1 ns / 43.5 ns 16777216 : 43.2 ns / 48.6 ns 33554432 : 86.2 ns / 112.2 ns 67108864 : 109.3 ns / 135.2 ns block size : single random read / dual random read, [MADV_HUGEPAGE] 1024 : 0.0 ns / 0.0 ns 2048 : 0.0 ns / 0.0 ns 4096 : 0.0 ns / 0.0 ns 8192 : 0.0 ns / 0.0 ns 16384 : 0.0 ns / 0.0 ns 32768 : 0.0 ns / 0.0 ns 65536 : 0.0 ns / 0.0 ns 131072 : 1.3 ns / 1.8 ns 262144 : 1.9 ns / 2.3 ns 524288 : 2.2 ns / 2.5 ns 1048576 : 2.6 ns / 2.8 ns 2097152 : 21.6 ns / 31.6 ns 4194304 : 31.1 ns / 39.4 ns 8388608 : 35.8 ns / 41.7 ns 16777216 : 38.5 ns / 43.0 ns 33554432 : 79.9 ns / 104.9 ns 67108864 : 101.1 ns / 125.4 ns ```