geerlingguy / sbc-reviews

Jeff Geerling's SBC review data - Raspberry Pi, Radxa, Orange Pi, etc.
MIT License
600 stars 13 forks source link

Ampere Altra Developer Platform #19

Open geerlingguy opened 1 year ago

geerlingguy commented 1 year ago

ampere-altra-radiator-water-cooling-cpu

Basic information

(This thing is... kinda more than a 'board' — but I still want data somewhere, and this is as good a place as any!)

Linux/system information

# output of `neofetch`
jgeerling@ampere-altra:~$ neofetch
            .-/+oossssoo+/-.               jgeerling@ampere-altra 
        `:+ssssssssssssssssss+:`           ---------------------- 
      -+ssssssssssssssssssyyssss+-         OS: Ubuntu 22.04.2 LTS aarch64 
    .ossssssssssssssssssdMMMNysssso.       Host: Ampere Altra Developer Platform ES2 
   /ssssssssssshdmmNNmmyNMMMMhssssss/      Kernel: 5.19.0-40-generic 
  +ssssssssshmydMMMMMMMNddddyssssssss+     Uptime: 5 mins 
 /sssssssshNMMMyhhyyyyhmNMMMNhssssssss/    Packages: 1565 (dpkg), 11 (snap) 
.ssssssssdMMMNhsssssssssshNMMMdssssssss.   Shell: bash 5.1.16 
+sssshhhyNMMNyssssssssssssyNMMMysssssss+   Resolution: 1920x1080 
ossyNMMMNyMMhsssssssssssssshmmmhssssssso   Terminal: /dev/pts/1 
ossyNMMMNyMMhsssssssssssssshmmmhssssssso   CPU: (96) @ 2.800GHz 
+sssshhhyNMMNyssssssssssssyNMMMysssssss+   GPU: 0004:02:00.0 ASPEED Technology, Inc. ASPEED Graphics Family 
.ssssssssdMMMNhsssssssssshNMMMdssssssss.   Memory: 2102MiB / 63897MiB 
 /sssssssshNMMMyhhyyyyhdNMMMNhssssssss/
  +sssssssssdmydMMMMMMMMddddyssssssss+                             
   /ssssssssssshdmNNNNmyNMMMMhssssss/                              
    .ossssssssssssssssssdMMMNysssso.
      -+sssssssssssssssssyyyssss+-
        `:+ssssssssssssssssss+:`
            .-/+oossssoo+/-.

# output of `uname -a`
Linux ampere-altra 5.19.0-40-generic #41~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Fri Mar 31 16:02:33 UTC 2 aarch64 aarch64 aarch64 GNU/Linux

Benchmark results

CPU

Configured with 96 GB RAM (6 x 16GB DDR4 ECC Registered DIMMs):

Power

Disk

Transcend 128GB PCIe Gen 3 x4 NVMe SSD (TS128GMTE652T)

Benchmark Result
fio 1M sequential read 1245 MB/s
iozone 1M random read 1058 MB/s
iozone 1M random write 665 MB/s
iozone 4K random read 72.99 MB/s
iozone 4K random write 246.88 MB/s

curl https://raw.githubusercontent.com/geerlingguy/pi-cluster/master/benchmarks/disk-benchmark.sh | sudo bash

Run benchmark on any attached storage device (e.g. eMMC, microSD, NVMe, SATA) and add results under an additional heading. Download the script with curl -o disk-benchmark.sh [URL_HERE] and run sudo DEVICE_UNDER_TEST=/dev/sda DEVICE_MOUNT_PATH=/mnt/sda1 ./disk-benchmark.sh (assuming the device is sda).

Also consider running PiBenchmarks.com script.

PiBenchmarks.com result: TODO - should be on https://pibenchmarks.com/latest/ soon

     Category                  Test                      Result      
HDParm                    Disk Read                 1533.06 MB/s             
HDParm                    Cached Disk Read          776.86 MB/s              
DD                        Disk Write                407 MB/s                 
FIO                       4k random read            94377 IOPS (377511 KB/s) 
FIO                       4k random write           74202 IOPS (296811 KB/s) 
IOZone                    4k read                   243709 KB/s              
IOZone                    4k write                  198612 KB/s              
IOZone                    4k random read            70575 KB/s               
IOZone                    4k random write           231884 KB/s              

                          Score: 45797  

Network

(Everything runs as expected... this thing's a bonafide server!)

GPU

Memory

tinymembench results:

Click to expand memory benchmark result ``` tinymembench v0.4.10 (simple benchmark for memory throughput and latency) ========================================================================== == Memory bandwidth tests == == == == Note 1: 1MB = 1000000 bytes == == Note 2: Results for 'copy' tests show how many bytes can be == == copied per second (adding together read and writen == == bytes would have provided twice higher numbers) == == Note 3: 2-pass copy means that we are using a small temporary buffer == == to first fetch data into it, and only then write it to the == == destination (source -> L1 cache, L1 cache -> destination) == == Note 4: If sample standard deviation exceeds 0.1%, it is shown in == == brackets == ========================================================================== C copy backwards : 9424.0 MB/s C copy backwards (32 byte blocks) : 9387.8 MB/s C copy backwards (64 byte blocks) : 9390.8 MB/s C copy : 9366.1 MB/s C copy prefetched (32 bytes step) : 9984.4 MB/s C copy prefetched (64 bytes step) : 9984.1 MB/s C 2-pass copy : 6391.4 MB/s C 2-pass copy prefetched (32 bytes step) : 7237.8 MB/s C 2-pass copy prefetched (64 bytes step) : 7489.6 MB/s C fill : 43884.4 MB/s C fill (shuffle within 16 byte blocks) : 43885.4 MB/s C fill (shuffle within 32 byte blocks) : 43884.2 MB/s C fill (shuffle within 64 byte blocks) : 43877.5 MB/s NEON 64x2 COPY : 9961.9 MB/s NEON 64x2x4 COPY : 10091.6 MB/s NEON 64x1x4_x2 COPY : 8171.5 MB/s NEON 64x2 COPY prefetch x2 : 11822.9 MB/s NEON 64x2x4 COPY prefetch x1 : 12123.8 MB/s NEON 64x2 COPY prefetch x1 : 11836.5 MB/s NEON 64x2x4 COPY prefetch x1 : 12122.3 MB/s --- standard memcpy : 9894.0 MB/s standard memset : 44745.2 MB/s --- NEON LDP/STP copy : 9958.0 MB/s NEON LDP/STP copy pldl2strm (32 bytes step) : 11415.6 MB/s NEON LDP/STP copy pldl2strm (64 bytes step) : 11420.5 MB/s NEON LDP/STP copy pldl1keep (32 bytes step) : 11475.2 MB/s NEON LDP/STP copy pldl1keep (64 bytes step) : 11452.9 MB/s NEON LD1/ST1 copy : 10094.8 MB/s NEON STP fill : 44744.7 MB/s NEON STNP fill : 44745.2 MB/s ARM LDP/STP copy : 10136.4 MB/s ARM STP fill : 44731.7 MB/s ARM STNP fill : 44730.0 MB/s ========================================================================== == Memory latency test == == == == Average time is measured for random memory accesses in the buffers == == of different sizes. The larger is the buffer, the more significant == == are relative contributions of TLB, L1/L2 cache misses and SDRAM == == accesses. For extremely large buffer sizes we are expecting to see == == page table walk with several requests to SDRAM for almost every == == memory access (though 64MiB is not nearly large enough to experience == == this effect to its fullest). == == == == Note 1: All the numbers are representing extra time, which needs to == == be added to L1 cache latency. The cycle timings for L1 cache == == latency can be usually found in the processor documentation. == == Note 2: Dual random read means that we are simultaneously performing == == two independent memory accesses at a time. In the case if == == the memory subsystem can't handle multiple outstanding == == requests, dual random read has the same timings as two == == single reads performed one after another. == ========================================================================== block size : single random read / dual random read, [MADV_NOHUGEPAGE] 1024 : 0.0 ns / 0.0 ns 2048 : 0.0 ns / 0.0 ns 4096 : 0.0 ns / 0.0 ns 8192 : 0.0 ns / 0.0 ns 16384 : 0.0 ns / 0.0 ns 32768 : 0.0 ns / 0.0 ns 65536 : 0.0 ns / 0.0 ns 131072 : 1.3 ns / 1.8 ns 262144 : 2.3 ns / 2.9 ns 524288 : 3.2 ns / 3.9 ns 1048576 : 3.6 ns / 4.2 ns 2097152 : 22.9 ns / 33.0 ns 4194304 : 32.6 ns / 40.9 ns 8388608 : 38.1 ns / 43.5 ns 16777216 : 43.2 ns / 48.6 ns 33554432 : 86.2 ns / 112.2 ns 67108864 : 109.3 ns / 135.2 ns block size : single random read / dual random read, [MADV_HUGEPAGE] 1024 : 0.0 ns / 0.0 ns 2048 : 0.0 ns / 0.0 ns 4096 : 0.0 ns / 0.0 ns 8192 : 0.0 ns / 0.0 ns 16384 : 0.0 ns / 0.0 ns 32768 : 0.0 ns / 0.0 ns 65536 : 0.0 ns / 0.0 ns 131072 : 1.3 ns / 1.8 ns 262144 : 1.9 ns / 2.3 ns 524288 : 2.2 ns / 2.5 ns 1048576 : 2.6 ns / 2.8 ns 2097152 : 21.6 ns / 31.6 ns 4194304 : 31.1 ns / 39.4 ns 8388608 : 35.8 ns / 41.7 ns 16777216 : 38.5 ns / 43.0 ns 33554432 : 79.9 ns / 104.9 ns 67108864 : 101.1 ns / 125.4 ns ```
geerlingguy commented 1 year ago

Also tested with Phoronix:

geerlingguy commented 1 year ago

Tried to install Windows 11 and it's been quite the saga. Haven't been able to get an ISO that boots yet:

I know at least Windows 11 Pro can be run on these chips:

I hate knowing something is possible but not being able to do it lol.

Edit: After a ton of searching I also found this article: Ampere ALTRA – Industry leading ARM64 Server — it says you can use qemu-image to convert a vhdx to a raw disk image file (qemu-img convert -O raw Windows11_InsiderPreview_Client_ARM64_en-us_25201.VHDX Windows11_InsiderPreview_Client_ARM64_en-us_25201.raw), so I may see if that's possible. If I can get that VHDX file... grr...

geerlingguy commented 1 year ago

On my Mac, I downloaded a copy of the VMDK file kindly provided by DB Tech. Then I installed qemu with brew install qemu, and ran:

qemu-img convert -O raw Windows11_InsiderPreview_Client_ARM64_en-us_25324.VHDX Windows11_InsiderPreview_Client_ARM64_en-us_25324.raw

I formatted my Corsair USB stick (128 GB) as ExFAT and copied over the .raw file (it's like 70 GB!).

I booted up Ubuntu on the Ampere Altra, and am copying the .raw file contents over to an unused NVMe SSD:

sudo dd if=/media/jgeerling/Untitled/Windows11_InsiderPreview_Client_ARM64_en-us_25324.raw of=/dev/nvme0n1

Note: on this system, the second drive has Ubuntu installed (/dev/nvme1n1). If you're coming from Google and trying to copy and paste... make sure you don't blow away an installation by copying over it on nvme0n1 lol.

The copy took about 24 minutes (yowza!) at 48.7 MB/sec. Rebooting now...

geerlingguy commented 1 year ago

It sounds like according to some, scaling beyond 32 or 64 cores might break with Geekbench (possibly due to some internal configuration), so another way to game that system is to lock an instance of Geekbench 5 to each core individually, e.g.:

for ((i=0; i<60; ++i)); do numactl -C $i ./geekbench5 --cpu & >> "gkbch"$i 2>&1; done

The core count could be more efficient at a higher or lower number (accounting for overhead) than the 60 above. Also, install numactl with sudo apt install -y numactl.

The 30,000 score does seem low even accounting for overhead losses, compared to an 800 or so single core.

Running it individually on 60 cores, I ended up locking up the machine a bit, but it did seem to be progressing nicely through all the single-core tests. I'll try again later.

geerlingguy commented 1 year ago

Drat, the Windows boot results in a sad face blue screen of death with ACPI_BIOS_ERROR.

Ft8xKhNX0AAvRrl

geerlingguy commented 1 year ago

To get an Nvidia Quadro RTX 8000 working, I ran sudo apt install libglvnd-dev pkg-config, then installed Nvidia's latest aarch64 driver for the Quadro RTX 8000 (running the .run file with sudo).

I can get full resolution video output through DisplayPort on the card, and Ubuntu (and neofetch) sees the card, alongside lshw. But:

$ nvidia-smi
No devices were found

$ sudo dmesg | tail
[  276.229559] NVRM: GPU 000d:01:00.0: RmInitAdapter failed! (0x24:0x65:1423)
[  276.229665] NVRM: GPU 000d:01:00.0: rm_init_adapter failed, device minor number 0
[  280.597471] NVRM: GPU 000d:01:00.0: RmInitAdapter failed! (0x24:0x65:1423)
[  280.597541] NVRM: GPU 000d:01:00.0: rm_init_adapter failed, device minor number 0
[  466.107317] audit: type=1107 audit(1681788653.426:93): pid=1779 uid=102 auid=4294967295 ses=4294967295 subj=unconfined msg='apparmor="DENIED" operation="dbus_signal"  bus="system" path="/org/freedesktop/login1" interface="org.freedesktop.login1.Manager" member="SessionNew" name=":1.12" mask="receive" pid=7171 label="snap.firefox.firefox" peer_pid=1858 peer_label="unconfined"
                exe="/usr/bin/dbus-daemon" sauid=102 hostname=? addr=? terminal=?'
[  473.623874] NVRM: GPU 000d:01:00.0: RmInitAdapter failed! (0x24:0x65:1423)
[  473.623963] NVRM: GPU 000d:01:00.0: rm_init_adapter failed, device minor number 0
[  477.995897] NVRM: GPU 000d:01:00.0: RmInitAdapter failed! (0x24:0x65:1423)
[  477.995979] NVRM: GPU 000d:01:00.0: rm_init_adapter failed, device minor number 0

In the Nvidia dev forums, I found this post which states:

The nvidia driver doesn’t like your board. ... This is a known and recurring bug in the nvidia driver, the known working versions list is based on reports from other users.

So... looks like it's off to the races to find a working driver!

I also opened an issue on the Ampere Developers site: GPU Support for Ampere Altra?.

geerlingguy commented 1 year ago

Er... apparently I still had the VGA plugged into my monitor, so the DisplayPort output wasn't actually working. D'oh! Well, we'll see what others say. I might have better luck compiling Linux 6.2 with AMD's latest driver support and my RX 6700 XT. We'll see!

istori1 commented 1 year ago

I have been enabling aarach64 support of NVIDIA on Flathub and some apps. When you get the NVIDIA GPU running there are couple of apps you can run.

https://github.com/flathub/org.freedesktop.Platform.GL.nvidia/pull/148 https://github.com/flathub/org.freedesktop.Sdk.Extension.llvm15/pull/10 https://github.com/flathub/org.kde.kdenlive/pull/207

https://github.com/flathub/org.jellyfin.JellyfinServer

https://github.com/flathub/com.dec05eba.gpu_screen_recorder

https://github.com/LizardByte/Sunshine/blob/master/packaging/linux/flatpak/dev.lizardbyte.sunshine.yml#L195

ThomasKaiser commented 1 year ago

The 30,000 score does seem low even accounting for overhead losses, compared to an 800 or so single core.

Here someone tests an 80 core Ampere Altra machine, obviously with more reliable power monitoring and generating higher GB5 scores than your machine (882/444285): https://youtu.be/m6-juFXR9c0?t=560

All those 80 core benchmarks show a higher single-threaded score than your machine but multi differs drastically (33000 - 46500): https://browser.geekbench.com/search?page=1&q=MP32-AR1-SW-HZ-001

Why do these scores differ? Obviously nobody knows and nobody cares (at least all scores exceeding 40000 have 256 GB RAM in common)

ThomasKaiser commented 1 year ago

BTW: could you please share output from this command:

cat /sys/devices/system/cpu/cpu*/topology/physical_package_id | sort | uniq -c

I'm curious since a dual socket Ampere Altra Max system showed different package IDs for cpu0 - cpu8 but then identical package ID for all the remaining 247 cores. Wondering whether here it's the same (affects sbc-bench -j mode that is designed to help explaining why Geekbench scores differ)

geerlingguy commented 1 year ago

Why do these scores differ? Obviously nobody knows and nobody cares (at least all scores exceeding 40000 have 256 GB RAM in common)

Note that I was told by the folks at Ampere I should fill all six RAM slots in the machine for better scores, as the NUMA layout on the Ampere requires all channels to be full for best performance. Whether that affects Geekbench scores in particular remains to be seen, but judging by STH's testing on Epyc and Ampere, it seems that's a reasonable outcome.

I'll soon be ordering two more 16GB sticks of RAM to match what's already installed. Right now I have the Ampere workstation back in a box because I'm trying to wrap up another project that's taking up the majority of the space in my little workshop :( — I hope to get it back out in mid May.

ThomasKaiser commented 1 year ago

as the NUMA layout on the Ampere requires all channels to be full for best performance

Oh, now with you mentioning NUMA I realize that there's no Ampere Altra with 96 cores so your box combines an AADP-64 in one CPU socket with an AADP-32 in the other? Unfortunately the 'info' collected by neofetch lacks most interesting bits.

Anway: once you unpack your box again I would really love to see sbc-bench -G (executing Geekbench in monitored environment) or sbc-bench -j and after the initial measurements then executing Geekbench in another Terminal. Since the tool has been designed to also provide hints as why scores differ (could be as simple as 'excessive swapping' since Geekbench needs an insane amount of memory per thread)

geerlingguy commented 1 year ago

This chip is the M96-28, which is a 96-core part running at 2.8 GHz. I will try to get the machine up and running this week if I get a chance, and I'll give you more info. But might not be able to until may again.

geerlingguy commented 1 year ago

It looks like their EDK2 fork just got a fix for the Windows BSOD I was running into (apparently it should've already worked on the 32/64/80 core Altra parts, but it needed fixing on Altra Max (96/128 core parts). Waiting for a new version of the firmware to be available and I'll test it out again!

joespeed commented 1 year ago

(apparently it should've already worked on the 32/64/80 core Altra parts, but it needed fixing on Altra Max (96/128 core parts). Waiting for a new version of the firmware to be available and I'll test it out again!

that's what I get for thinking myself clever and having 96 core swapped into your PC🙄

Bohanke commented 1 year ago

Just send you the new release for edk2 that will enable you to run Windows 11 on the new Ampere Altra Max generation

geerlingguy commented 1 year ago

I have Windows 11 Pro for Arm Insider Preview running now. A few stats:

geerlingguy commented 1 year ago

Yes, it can play Crysis. Depending on your definition of the word 'play' ;)

DSC01874

farcop commented 1 year ago

I missed in the article an attempt to install esxi-arm on this platform and, for example, deploying k8s. Seems like it must have.

rbapat-ampere commented 1 year ago

A few tips and tricks to boost HPL performance. (Visit Ampere Github for a detailed explanation)

  1. Change HPL.dat input values to P = 8 and Q = 12 respectively.

  2. Change HPL.dat Ns value to maximize the RAM utilization.

  3. Use the Ampere-Oracle BLIS Math libraries. Instructions on how to install/use them are as follows.

     git clone https://github.com/flame/blis.git MyBlisDir
     change to  MyBlisDir
    
     #Switch to the new ampere branch 
     git checkout ampere
     ./QuickStart.sh altramax
    
     #Ensure that the test bench contains Ampere Oracle Blis exported to PATH and LD_LIBRARY_PATH appropriately.
     source ./blis_build_altramax.sh
     source blis_setenv.sh
     export LD_LIBRARY_PATH=<Install_Path>/MyBlisDir/lib/altramax
  4. Use the attached Makefile to build the benchmark: Make.zip (I had to zip the makefile since I could not attach it directly)

  5. Edit lines 70, 95,96 & 97 in the makefile to match your HPL and Ampere-Oracle BLIS installation paths.

  6. Build the binary: make arch=Altramax_oracleblis -j

  7. Run benchmark as usual.

geerlingguy commented 1 year ago

I'm re-testing my system a bit using the above guidelines from @rbapat-ampere — thanks! See https://github.com/geerlingguy/top500-benchmark/issues/10

rbapat-ampere commented 1 year ago

I'm re-testing my system a bit using the above guidelines from @rbapat-ampere — thanks! See geerlingguy/top500-benchmark#10

Thanks @geerlingguy . Also, if you'd like a a more detailed guide including performance expectations (ie scores that we are observing on our end) refer : https://github.com/AmpereComputing/HPL-on-Ampere-Altra

geerlingguy commented 1 year ago

Interesting findings after upgrading to 96 GB of RAM (6 x 16 GB ECC DDR4 sticks)

That new Geekbench 5 multicore score bumps the system up to page 216 of the results

geerlingguy commented 1 year ago

The Geekbench 6.1.0 for Windows aarch64 preview seems to work, the result is submitted, but then the console window it's running in disappears without giving a claim key. But here is the result of one of my test runs: https://browser.geekbench.com/v6/cpu/1567571 (1008 single core / 10481 multi core).

Watching Windows' built-in CPU core monitor, it seems less than half of the available 96 cores are actually in use, so the multi core score is wildly inaccurate. See: http://support.primatelabs.com/discussions/geekbench/82502-geekbench-6-doesnt-install-correctly-under-windows-on-arm-on-ampere

geerlingguy commented 1 year ago

As an additional data point, with 96 GB of RAM, the measly 5647 Cinebench R23 score under Windows 11 for Arm bumps to a slightly-less-measly 6858 - that's a 20% speedup! Just imagine how well Cinebench would run if it weren't only optimized for X86 :)

geerlingguy commented 1 year ago

Following the recommended installation for NVIDIA drivers for Linux, I am now seeing output through my 3080 Ti, and can have some more fun.

nvidia-smi output:

jgeerling@ampere-altra:~$ nvidia-smi
Mon Jun 12 09:55:16 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.116.04   Driver Version: 525.116.04   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 0000000D:01:00.0  On |                  N/A |
| 48%   53C    P0   111W / 400W |    315MiB / 12288MiB |      1%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      3631      G   /usr/lib/xorg/Xorg                 96MiB |
|    0   N/A  N/A      3805      G   /usr/bin/gnome-shell              128MiB |
|    0   N/A  N/A      5079      G   ...7/usr/lib/firefox/firefox       87MiB |
+-----------------------------------------------------------------------------+

glmark2 score:

jgeerling@ampere-altra:~$ glmark2
=======================================================
    glmark2 2021.02
=======================================================
    OpenGL Information
    GL_VENDOR:     NVIDIA Corporation
    GL_RENDERER:   NVIDIA GeForce RTX 3080 Ti/PCIe
    GL_VERSION:    4.6.0 NVIDIA 525.116.04
=======================================================
[build] use-vbo=false: FPS: 6367 FrameTime: 0.157 ms
[build] use-vbo=true: FPS: 12194 FrameTime: 0.082 ms
[texture] texture-filter=nearest: FPS: 12209 FrameTime: 0.082 ms
[texture] texture-filter=linear: FPS: 12207 FrameTime: 0.082 ms
[texture] texture-filter=mipmap: FPS: 12224 FrameTime: 0.082 ms
[shading] shading=gouraud: FPS: 12028 FrameTime: 0.083 ms
[shading] shading=blinn-phong-inf: FPS: 11940 FrameTime: 0.084 ms
[shading] shading=phong: FPS: 11976 FrameTime: 0.084 ms
[shading] shading=cel: FPS: 11990 FrameTime: 0.083 ms
[bump] bump-render=high-poly: FPS: 11369 FrameTime: 0.088 ms
[bump] bump-render=normals: FPS: 12373 FrameTime: 0.081 ms
[bump] bump-render=height: FPS: 12112 FrameTime: 0.083 ms
[effect2d] kernel=0,1,0;1,-4,1;0,1,0;: FPS: 12015 FrameTime: 0.083 ms
[effect2d] kernel=1,1,1,1,1;1,1,1,1,1;1,1,1,1,1;: FPS: 11416 FrameTime: 0.088 ms
[pulsar] light=false:quads=5:texture=false: FPS: 12023 FrameTime: 0.083 ms
[desktop] blur-radius=5:effect=blur:passes=1:separable=true:windows=4: FPS: 5273 FrameTime: 0.190 ms
[desktop] effect=shadow:windows=4: FPS: 6034 FrameTime: 0.166 ms
[buffer] columns=200:interleave=false:update-dispersion=0.9:update-fraction=0.5:update-method=map: FPS: 1432 FrameTime: 0.698 ms
[buffer] columns=200:interleave=false:update-dispersion=0.9:update-fraction=0.5:update-method=subdata: FPS: 1760 FrameTime: 0.568 ms
[buffer] columns=200:interleave=true:update-dispersion=0.9:update-fraction=0.5:update-method=map: FPS: 1996 FrameTime: 0.501 ms
[ideas] speed=duration: FPS: 8453 FrameTime: 0.118 ms
[jellyfish] <default>: FPS: 11187 FrameTime: 0.089 ms
[terrain] <default>: FPS: 1677 FrameTime: 0.596 ms
[shadow] <default>: FPS: 10011 FrameTime: 0.100 ms
[refract] <default>: FPS: 7168 FrameTime: 0.140 ms
[conditionals] fragment-steps=0:vertex-steps=0: FPS: 12174 FrameTime: 0.082 ms
[conditionals] fragment-steps=5:vertex-steps=0: FPS: 12092 FrameTime: 0.083 ms
[conditionals] fragment-steps=0:vertex-steps=5: FPS: 12086 FrameTime: 0.083 ms
[function] fragment-complexity=low:fragment-steps=5: FPS: 12119 FrameTime: 0.083 ms
[function] fragment-complexity=medium:fragment-steps=5: FPS: 12069 FrameTime: 0.083 ms
[loop] fragment-loop=false:fragment-steps=5:vertex-steps=5: FPS: 12090 FrameTime: 0.083 ms
[loop] fragment-steps=5:fragment-uniform=false:vertex-steps=5: FPS: 12016 FrameTime: 0.083 ms
[loop] fragment-steps=5:fragment-uniform=true:vertex-steps=5: FPS: 11917 FrameTime: 0.084 ms
=======================================================
                                  glmark2 Score: 9878 
=======================================================

Super Tux Kart is giving 112 fps on max settings at 1440x900, and I'm going to test the Doom 3 install they have in their repo next.

geerlingguy commented 1 year ago

The Doom 3 demo gave a constant 60fps, and I couldn't seem to get it unlocked. I was able to install openarena as well, and that gave 1000 fps continuous when I set max framerate to 0 but it started getting jittery (I don't know if I could go beyond that), and if I set 900 or below, it would just lock that frame rate all day.

I'd say the 3080 Ti is complete overkill for 10-20+ year old games :D

geerlingguy commented 1 year ago

I was trying to get Stable Diffusion running, but ran into some issues with CUDA + NVIDIA driver versions. I might attempt again with a completely clean install, but NVIDIA's support makes it all quite 'fun', as finding compatible driver / CUDA version pairings can be a challenge, and it doesn't help that most of the stable diffusion guides want to you to install their own mix of packages which may or may not break your working CUDA install, lol.

geerlingguy commented 1 year ago

Using the optimized blis library from Oracle, I was able to get 985 Gflops at 270W, for a new efficiency of 3.64 Gflops/W

ThomasKaiser commented 1 year ago

This chip is the M96-28, which is a 96-core part running at 2.8 GHz

Have you tried to verify the '2.8 GHz' (e.g. with sbc-bench -j)?

The reason why I'm asking (link might not work with Safari for whatever reason) is Willy's cpufreq measurements here at the very bottom: https://www.cnx-software.com/2023/04/18/adlink-ampere-altra-dev-kit-features-atx-motherboard-with-32-to-80-core-arm-com-hpc-cpu-module/#comment-608818

geerlingguy commented 1 year ago

New video today: https://www.youtube.com/watch?v=ydGdHjIncbk

And @ThomasKaiser - I'll try to get a run on it later today.

AlexLandherr commented 1 year ago

Hi Jeff,

To run my benchmark prime_threads just follow the README. I've coded everything in C++ and its standard library. My dev system is an Ubuntu laptop using an i7-12700H. My precise OS version is Ubuntu 22.04.2 LTS 64-bit.

So far on ARM I've only tried on a RPi4B without issue. One thing that might be a problem is that one LD flag is -lpthread, though I could be wrong as I am only about 9 months into C++.

Repo (EDIT: Corrected the repo link): prime_threads

geerlingguy commented 1 year ago

@ThomasKaiser - Posting results here: https://github.com/ThomasKaiser/sbc-bench/issues/72

jgeerling@adlink-ampere:~/sbc-bench$ sudo ./sbc-bench.sh

sbc-bench v0.9.42

Installing needed tools: apt -f -qq -y install lm-sensors sysstat powercap-utils p7zip, tinymembench, ramlat, mhz. Done.
Checking cpufreq OPP. Done (results will be available in 7-10 minutes).
Executing tinymembench. Done.
Executing RAM latency tester. Done.
Executing OpenSSL benchmark. Done.
Executing 7-zip benchmark. Done.
Checking cpufreq OPP again. Done (8 minutes elapsed).

Results validation:

  * Measured clockspeed not lower than advertised max CPU clockspeed
  * No swapping
  * Background activity (%system) OK
  * Throttling occured

Memory performance
memcpy: 10132.6 MB/s
memset: 44753.7 MB/s

7-zip total scores (3 consecutive runs): 249521,248641,249970, single-threaded: 3858

OpenSSL results:
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
aes-128-cbc     796165.06k  1576202.24k  2023527.51k  2171779.07k  2227093.50k  2232833.37k
aes-128-cbc     791871.84k  1579594.97k  2023827.97k  2169844.05k  2226918.74k  2232669.53k
aes-192-cbc     750426.13k  1387396.22k  1718240.43k  1807296.85k  1857686.19k  1861642.92k
aes-192-cbc     744861.90k  1387979.75k  1717576.70k  1807480.15k  1857890.99k  1861675.69k
aes-256-cbc     715428.22k  1227382.25k  1484381.87k  1563873.62k  1593264.81k  1596145.66k
aes-256-cbc     716891.95k  1230795.03k  1486725.72k  1563989.67k  1593251.16k  1596080.13k

Full results uploaded to http://ix.io/4zGI
geerlingguy commented 1 year ago

@AlexLandherr - I ran the benchmark and noticed only one core was getting any load:

jgeerling@adlink-ampere:~/prime_benchmark$ ./prime_benchmark
Start prime_benchmark between 1 and 100000000? [Y/n]: Y
Starting prime benchmark...
Iteration 1 of 10 Runtime (ns): 98088310584
Iteration 2 of 10 Runtime (ns): 98097435960
Iteration 3 of 10 Runtime (ns): 98094310171
Iteration 4 of 10 Runtime (ns): 98087703900
Iteration 5 of 10 Runtime (ns): 98089309445
...

I think it might be faster if it ran on all cores :D

AlexLandherr commented 1 year ago

@AlexLandherr - I ran the benchmark and noticed only one core was getting any load:


jgeerling@adlink-ampere:~/prime_benchmark$ ./prime_benchmark

Start prime_benchmark between 1 and 100000000? [Y/n]: Y

Starting prime benchmark...

Iteration 1 of 10 Runtime (ns): 98088310584

Iteration 2 of 10 Runtime (ns): 98097435960

Iteration 3 of 10 Runtime (ns): 98094310171

Iteration 4 of 10 Runtime (ns): 98087703900

Iteration 5 of 10 Runtime (ns): 98089309445

...

I think it might be faster if it ran on all cores :D

Check the corrected link, I messed up the URLs. Correct one: https://github.com/AlexLandherr/prime_threads

EDIT 1: I hope it does better than the result from that old test program which I probably should make private again since it's legacy code to myself.

EDIT 2: Now the old repo is private. I've modified the prime_threads source code a bit so that all the iteration times from a run are also written to the log file in the logs directory.

If you successfully run it would it be okay with you if I link to your results in my README for prime_threads?

ThomasKaiser commented 1 year ago

Talking about reinventing the wheel...

I ran the benchmark and noticed only one core was getting any load

Same with Bruce's primes benchmark he introduced over there a while ago: https://hoult.org/primes.txt

I think it might be faster if it ran on all cores

I wonder which use case / workload such a multi-threaded 'benchmark' could represent since the whole working set fits inside CPU caches but usually server machines like these Ampere boxes run tasks that depend (heavily) on RAM access...

AlexLandherr commented 1 year ago

Talking about reinventing the wheel...

I ran the benchmark and noticed only one core was getting any load

Same with Bruce's primes benchmark he introduced over there a while ago: https://hoult.org/primes.txt

I think it might be faster if it ran on all cores

I wonder which use case / workload such a multi-threaded 'benchmark' could represent since the whole working set fits inside CPU caches but usually server machines like these Ampere boxes run tasks that depend (heavily) on RAM access...

I haven't thought about that so much, I mostly wrote mine for the purpose of learning more about the language itself and time seemed like an easily understandable measure.

geerlingguy commented 1 year ago

@AlexLandherr - results:

jgeerling@adlink-ampere:~/prime_threads$ ./prime_threads
System supports 96 threads.
    ****    
Search limit for all options are (both are inclusive limits):
Lower search limit: 1
Upper search limit: 100000000

Select benchmark mode from list below:
0. Single-threaded benchmark.
1. Multi-threaded benchmark (96 threads).
2. Exit program.
Enter one of the listed values: 1

Numerator (also upper search limit): 100000000
Denominator (also how many threads): 96
Quotient: 1041666
Remainder: 64
  ****  
Starting multithreaded prime benchmark...

Started at: 2023-07-05 02:10:34 UTC
Iteration 1 of 10 Runtime (ns): 1650924933
Iteration 2 of 10 Runtime (ns): 1381674931
Iteration 3 of 10 Runtime (ns): 1381530659
Iteration 4 of 10 Runtime (ns): 1381874490
Iteration 5 of 10 Runtime (ns): 1382733573
Iteration 6 of 10 Runtime (ns): 1380218386
Iteration 7 of 10 Runtime (ns): 1392295779
Iteration 8 of 10 Runtime (ns): 1380862075
Iteration 9 of 10 Runtime (ns): 1381572084
Iteration 10 of 10 Runtime (ns): 1381265241
Prime multithreaded benchmark is done!

Stopped at: 2023-07-05 02:10:48 UTC

**** Results ****
Program mode is: Multithreaded
Thread count: 96
Search started at: 2023-07-05 02:10:34 UTC
Search ended at: 2023-07-05 02:10:48 UTC
Program ran for total of (DD:HH:MM:SS.SSSSSSSSS): 00:00:00:14.095112951

Program ran for: 14095112951 ns

Average time to find all primes between 1 and 100000000 was (DD:HH:MM:SS.SSSSSSSSS):
00:00:00:01.409495215

Average search time: 1409495215 ns
Number of primes found is: 5761455
geerlingguy commented 1 year ago

@ThomasKaiser - I compiled @brucehoult's primes.txt and got:

jgeerling@adlink-ampere:~$ gcc -O3 primes.c -o primes
jgeerling@adlink-ampere:~$ ./primes
Starting run
3713160 primes found in 5641 ms
-480 bytes of code in countPrimes()

That seems slow to me, though—but the Ampere doesn't necessarily have the fastest single core performance, either...

(And with -01:)

3713160 primes found in 5830 ms
256 bytes of code in countPrimes()
ThomasKaiser commented 1 year ago

I compiled @brucehoult's primes.txt ... That seems slow to me

Why? The Ampere's N1 cores are A76 siblings (see also the core IDs) and for an A76 at 2.8 GHz with these cache sizes this looks reasonable.

And we shouldn't forget that the 2.8 GHz are still questionable, see Willy's investigations really measuring clockspeeds since unfortunately we can't trust the values the cpufreq driver spits out in general :(

AlexLandherr commented 1 year ago

@AlexLandherr - results:

jgeerling@adlink-ampere:~/prime_threads$ ./prime_threads
System supports 96 threads.
    ****    
Search limit for all options are (both are inclusive limits):
Lower search limit: 1
Upper search limit: 100000000

Select benchmark mode from list below:
0. Single-threaded benchmark.
1. Multi-threaded benchmark (96 threads).
2. Exit program.
Enter one of the listed values: 1

Numerator (also upper search limit): 100000000
Denominator (also how many threads): 96
Quotient: 1041666
Remainder: 64
  ****  
Starting multithreaded prime benchmark...

Started at: 2023-07-05 02:10:34 UTC
Iteration 1 of 10 Runtime (ns): 1650924933
Iteration 2 of 10 Runtime (ns): 1381674931
Iteration 3 of 10 Runtime (ns): 1381530659
Iteration 4 of 10 Runtime (ns): 1381874490
Iteration 5 of 10 Runtime (ns): 1382733573
Iteration 6 of 10 Runtime (ns): 1380218386
Iteration 7 of 10 Runtime (ns): 1392295779
Iteration 8 of 10 Runtime (ns): 1380862075
Iteration 9 of 10 Runtime (ns): 1381572084
Iteration 10 of 10 Runtime (ns): 1381265241
Prime multithreaded benchmark is done!

Stopped at: 2023-07-05 02:10:48 UTC

**** Results ****
Program mode is: Multithreaded
Thread count: 96
Search started at: 2023-07-05 02:10:34 UTC
Search ended at: 2023-07-05 02:10:48 UTC
Program ran for total of (DD:HH:MM:SS.SSSSSSSSS): 00:00:00:14.095112951

Program ran for: 14095112951 ns

Average time to find all primes between 1 and 100000000 was (DD:HH:MM:SS.SSSSSSSSS):
00:00:00:01.409495215

Average search time: 1409495215 ns
Number of primes found is: 5761455

Thanks very much Jeff! Here's my result for the max power settings on my i7-12700H:

Program mode is: Multithreaded
Thread count: 20
Search started at: 2023-07-05 08:33:13 UTC
Search ended at: 2023-07-05 08:34:11 UTC
Program ran for total of (DD:HH:MM:SS.SSSSSSSSS): 00:00:00:58.569630496

Program ran for: 58569630496 ns

Average time to find all primes between 1 and 100000000 was (DD:HH:MM:SS.SSSSSSSSS):
00:00:00:05.856952223

Average search time: 5856952223 ns
Number of primes found is: 5761455

**** Runtime for each iteration ****
Iteration 1 of 10 Runtime (ns): 5806562542
Iteration 2 of 10 Runtime (ns): 5679193186
Iteration 3 of 10 Runtime (ns): 5739044034
Iteration 4 of 10 Runtime (ns): 5868012868
Iteration 5 of 10 Runtime (ns): 5759130597
Iteration 6 of 10 Runtime (ns): 5850852134
Iteration 7 of 10 Runtime (ns): 6006729002
Iteration 8 of 10 Runtime (ns): 5994649762
Iteration 9 of 10 Runtime (ns): 5895252589
Iteration 10 of 10 Runtime (ns): 5970095523

Program started at: 2023-07-05 08:33:04 UTC
Program ended at: 2023-07-05 08:34:11 UTC

Is it okay with you for me to link to your results in my README?

geerlingguy commented 1 year ago

Is it okay with you for me to link to your results in my README?

Sure!

AlexLandherr commented 1 year ago

Is it okay with you for me to link to your results in my README?

Sure!

It's live since yesterday. I've been AFK from my dev laptop for about 48-ish hours.

Next up: learn more about make (I've got a book for it)!

Bohanke commented 1 year ago

Was a little bored in the weekend and since you started this on Windows 11, had to try it on Ubuntu with a graphics card installed.

Most of the time it will not start up, when it does its a miracle, got it working with Nvidia RTX6000 card. Online instructions are terrible had to combine several directions to even get box86/64 correctly compiled and installed Borrowed some instructions from RP4, but build flags are a little different This is the steam.deb from official homepage

We did not run any games yet because when it aborts it takes many starts to get it back up. We will spend some time to document it. 3470

A lot of complaints though asking for i386 libraries that are of course not available for arm (armhf)

3472 3471

geerlingguy commented 1 year ago

@Bohanke - I was having similar issues with the i386 packages.

Bohanke commented 1 year ago

Rock solid

Start with fulll Ubuntu Server 20.04.5 install

$ sudo apt update $ sudo apt upgrade

$ sudo apt install tasksel $ sudo tasksel install ubuntu-desktop $ sudo systemctl set-default graphical.target $ sudo reboot

$ sudo nano /etc/default/grub Add “pcie_aspm=off” into the line of GRUB_CMDLINE_LINUX=”” and save the change. Execute the command to update the grub and then reboot: $ sudo update-grub

$ sudo touch /etc/modprobe.d/blacklist-nouveau.conf $ sudo nano /etc/modprobe.d/blacklist-nouveau.conf

add :

blacklist nouveau options nouveau modeset=0

save file

$ sudo update-initramfs -u

W: Possible missing firmware /lib/firmware/ast_dp501_fw.bin for module ast ignore it !

$ sudo reboot

install some tools for Nvidia graphics and maybe CUUDA later ?? $ sudo apt-get install g++ freeglut3-dev build-essential libx11-dev libxmu-dev libxi-dev libglu1-mesa-dev libfreeimage-dev libglfw3-dev

Goto Nvidia homepage to download the install package for your particular card !!! In this example the card is RTX6000

Copy the file into home directory, make sure its executable

run the installer $ sudo ./NVIDIA-Linux-aarch64-535.54.03.run

(do not do this in SSH shell, it has to be done local !!!!)

$ sudo reboot

You now have Ubuntu desktop with Nvidia drivers installed and booting up the Nvidia display

BOX86 + SteAM and BOX64

sudo dpkg --add-architecture armhf

==== Do not compile right now because standard GCC and gcc-arm-linux-gnueabihf need for box86 are not clear to me !!!

Use @Itai-Nelken's apt repository to install precompiled box86 debs, updated weekly.

sudo wget https://itai-nelken.github.io/weekly-box86-debs/debian/box86.list -O /etc/apt/sources.list.d/box86.list wget -qO- https://itai-nelken.github.io/weekly-box86-debs/debian/KEY.gpg | sudo gpg --dearmor -o /etc/apt/trusted.gpg.d/box86-debs-archive-keyring.gpg sudo apt update && sudo apt install box86:armhf -y

Still use the script file to install steam but we need to clone git BOX86 sources for that

$ git clone https://github.com/ptitSeb/box86

$ cd box86

$ ./install_steam.sh

$ sudo wget https://ryanfortner.github.io/box64-debs/box64.list -O /etc/apt/sources.list.d/box64.list $ wget -qO- https://ryanfortner.github.io/box64-debs/KEY.gpg | sudo gpg --dearmor -o /etc/apt/trusted.gpg.d/box64-debs-archive-keyring.gpg $ sudo apt update && sudo apt install box64-generic-arm -y

maybe needed ?? $ sudo systemctl restart systemd-binfmt

$ sudo reboot

Bohanke commented 1 year ago

We, me and my son :)

made a target list for games to try out first. they are listed as Platinum where is comes to compatibility

Here is a lot of compatibility info : https://www.protondb.com/

Platinum most compatible and most popular

Castle Crashers 255MB !! Human: Fall Flat 500MB !! Untitled Goose Game 800MB Amazing Cultivation Sim. 1GB xxx MudRunner 1GB Cuphead 2GB Inside 3GB
Deep Rock Galactic 3GB xxx Pit People 3GB Farming Simulator 15 3GB xxx Risk of Rain 2 4GB Grim Dawn 5GB xxx What Remains of Edith Finch 5GB Alien Swarm: Reactive Drop 6GB ICEY 6GB Rocket League 7GB Outer Wilds 8GB xxx Mordheim: City of the Damned 8GB Dishonored 9GB

Little Nightmares 10GB Little Nightmares II 10GB DRAGON BALL XENOVERSE 10GB xxx Bloodstained: Ritual of the Night 10GB xxx Ori and the Blind Forest: Def Edition 11GB STEINS;GATE 13GB xxx Satisfactory 15GB xxx Atelier Ryza: Ever Darkness & Hideout 15GB xxx AI*Shoujo/AI*少女 15GB xxx A Plague Tale: Innocence 20GB xxx 鬼谷八荒 Tale of Immortal 25GB xxx Sekiro: Shadows Die Twice - GOTY Edition 25GB Ryse: Son of Rome 26GB Control Ultimate Edition 26GB Aliens: Fireteam Elite 30GB Wolfenstein II: The New Colossus 42GB Wolfenstein: The New Order 44GB The Witcher 3: Wild Hunt 50GB NieR:Automata 50GB ACE COMBAT 7: SKIES UNKNOWN 50GB SCARLET NEXUS 50GB Doom 55GB

xxx means maybe small issues

geerlingguy commented 1 year ago

Still trying to replicate the 1.2+ Tflop results with the 96-core CPU on my Dev Workstation, but still only getting just under 1 Tflop: https://github.com/AmpereComputing/HPL-on-Ampere-Altra/issues/10

ThomasKaiser commented 1 year ago

still only getting just under 1 Tflop

Have you tried to monitor benchmark execution this time? So do you really know your CPU cores clock at a certain clockspeed (I already mentioned the underlying issue here in this issue) and neither throttling nor swapping occured? Linpack FAQ clearly states:

If the problem size you pick is too large, swapping will occur, and the performance will drop

sbc-bench -R is designed for just this: checking whether the execution of any other benchmarks does what it should. Executed in another terminal window prior to benchmarking and being stopped by [ctrl]-[c] afterwards. Real clockspeeds (measured) will be reported prior/after benchmarking and whether throttling and/or swapping occured.

geerlingguy commented 1 year ago

@ThomasKaiser - Yes, we've been monitoring that, at this point I went to the extent of buying new Samsung RAM to match the sticks in Ampere's test system, and we've seen that the Samsung x4 RAM sticks are producing 20-30% faster RAM benchmarks than the Transcend x8 sticks (specs otherwise identical, DDR4-3200 ECC CL22).

I just re-tested in tinymembench and the Samsung RAM is getting:

 standard memcpy                                      :  12107.0 MB/s
 standard memset                                      :  44746.4 MB/s

Whereas the Transcend RAM was getting:

 standard memcpy                                      :   9894.0 MB/s
 standard memset                                      :  44745.2 MB/s

For more details, see https://github.com/AmpereComputing/HPL-on-Ampere-Altra/issues/10#issuecomment-1711717393

Re-running Top500 HPL benchmark now.