Open geerlingguy opened 1 year ago
Also tested with Phoronix:
pts/build-linux-kernel
: https://openbenchmarking.org/result/2304170-NE-LINUXCOMP70 (116.59 seconds)pts/phpbench
: https://openbenchmarking.org/result/2304174-NE-AMPEREALT14 (498149)pts/compress-7zip
: https://openbenchmarking.org/result/2304174-NE-AMPEREALT36 (161132 / 337214)Tried to install Windows 11 and it's been quite the saga. Haven't been able to get an ISO that boots yet:
I know at least Windows 11 Pro can be run on these chips:
I hate knowing something is possible but not being able to do it lol.
Edit: After a ton of searching I also found this article: Ampere ALTRA – Industry leading ARM64 Server — it says you can use qemu-image
to convert a vhdx to a raw disk image file (qemu-img convert -O raw Windows11_InsiderPreview_Client_ARM64_en-us_25201.VHDX Windows11_InsiderPreview_Client_ARM64_en-us_25201.raw
), so I may see if that's possible. If I can get that VHDX file... grr...
On my Mac, I downloaded a copy of the VMDK file kindly provided by DB Tech. Then I installed qemu
with brew install qemu
, and ran:
qemu-img convert -O raw Windows11_InsiderPreview_Client_ARM64_en-us_25324.VHDX Windows11_InsiderPreview_Client_ARM64_en-us_25324.raw
I formatted my Corsair USB stick (128 GB) as ExFAT and copied over the .raw
file (it's like 70 GB!).
I booted up Ubuntu on the Ampere Altra, and am copying the .raw file contents over to an unused NVMe SSD:
sudo dd if=/media/jgeerling/Untitled/Windows11_InsiderPreview_Client_ARM64_en-us_25324.raw of=/dev/nvme0n1
Note: on this system, the second drive has Ubuntu installed (/dev/nvme1n1
). If you're coming from Google and trying to copy and paste... make sure you don't blow away an installation by copying over it on nvme0n1
lol.
The copy took about 24 minutes (yowza!) at 48.7 MB/sec. Rebooting now...
It sounds like according to some, scaling beyond 32 or 64 cores might break with Geekbench (possibly due to some internal configuration), so another way to game that system is to lock an instance of Geekbench 5 to each core individually, e.g.:
for ((i=0; i<60; ++i)); do numactl -C $i ./geekbench5 --cpu & >> "gkbch"$i 2>&1; done
The core count could be more efficient at a higher or lower number (accounting for overhead) than the 60
above. Also, install numactl
with sudo apt install -y numactl
.
The 30,000 score does seem low even accounting for overhead losses, compared to an 800 or so single core.
Running it individually on 60 cores, I ended up locking up the machine a bit, but it did seem to be progressing nicely through all the single-core tests. I'll try again later.
Drat, the Windows boot results in a sad face blue screen of death with ACPI_BIOS_ERROR
.
To get an Nvidia Quadro RTX 8000 working, I ran sudo apt install libglvnd-dev pkg-config
, then installed Nvidia's latest aarch64 driver for the Quadro RTX 8000 (running the .run
file with sudo
).
I can get full resolution video output through DisplayPort on the card, and Ubuntu (and neofetch
) sees the card, alongside lshw
. But:
$ nvidia-smi
No devices were found
$ sudo dmesg | tail
[ 276.229559] NVRM: GPU 000d:01:00.0: RmInitAdapter failed! (0x24:0x65:1423)
[ 276.229665] NVRM: GPU 000d:01:00.0: rm_init_adapter failed, device minor number 0
[ 280.597471] NVRM: GPU 000d:01:00.0: RmInitAdapter failed! (0x24:0x65:1423)
[ 280.597541] NVRM: GPU 000d:01:00.0: rm_init_adapter failed, device minor number 0
[ 466.107317] audit: type=1107 audit(1681788653.426:93): pid=1779 uid=102 auid=4294967295 ses=4294967295 subj=unconfined msg='apparmor="DENIED" operation="dbus_signal" bus="system" path="/org/freedesktop/login1" interface="org.freedesktop.login1.Manager" member="SessionNew" name=":1.12" mask="receive" pid=7171 label="snap.firefox.firefox" peer_pid=1858 peer_label="unconfined"
exe="/usr/bin/dbus-daemon" sauid=102 hostname=? addr=? terminal=?'
[ 473.623874] NVRM: GPU 000d:01:00.0: RmInitAdapter failed! (0x24:0x65:1423)
[ 473.623963] NVRM: GPU 000d:01:00.0: rm_init_adapter failed, device minor number 0
[ 477.995897] NVRM: GPU 000d:01:00.0: RmInitAdapter failed! (0x24:0x65:1423)
[ 477.995979] NVRM: GPU 000d:01:00.0: rm_init_adapter failed, device minor number 0
In the Nvidia dev forums, I found this post which states:
The nvidia driver doesn’t like your board. ... This is a known and recurring bug in the nvidia driver, the known working versions list is based on reports from other users.
So... looks like it's off to the races to find a working driver!
I also opened an issue on the Ampere Developers site: GPU Support for Ampere Altra?.
Er... apparently I still had the VGA plugged into my monitor, so the DisplayPort output wasn't actually working. D'oh! Well, we'll see what others say. I might have better luck compiling Linux 6.2 with AMD's latest driver support and my RX 6700 XT. We'll see!
I have been enabling aarach64 support of NVIDIA on Flathub and some apps. When you get the NVIDIA GPU running there are couple of apps you can run.
https://github.com/flathub/org.freedesktop.Platform.GL.nvidia/pull/148 https://github.com/flathub/org.freedesktop.Sdk.Extension.llvm15/pull/10 https://github.com/flathub/org.kde.kdenlive/pull/207
https://github.com/flathub/org.jellyfin.JellyfinServer
The 30,000 score does seem low even accounting for overhead losses, compared to an 800 or so single core.
Here someone tests an 80 core Ampere Altra machine, obviously with more reliable power monitoring and generating higher GB5 scores than your machine (882/444285): https://youtu.be/m6-juFXR9c0?t=560
All those 80 core benchmarks show a higher single-threaded score than your machine but multi differs drastically (33000 - 46500): https://browser.geekbench.com/search?page=1&q=MP32-AR1-SW-HZ-001
Why do these scores differ? Obviously nobody knows and nobody cares (at least all scores exceeding 40000 have 256 GB RAM in common)
BTW: could you please share output from this command:
cat /sys/devices/system/cpu/cpu*/topology/physical_package_id | sort | uniq -c
I'm curious since a dual socket Ampere Altra Max system showed different package IDs for cpu0
- cpu8
but then identical package ID for all the remaining 247 cores. Wondering whether here it's the same (affects sbc-bench -j
mode that is designed to help explaining why Geekbench scores differ)
Why do these scores differ? Obviously nobody knows and nobody cares (at least all scores exceeding 40000 have 256 GB RAM in common)
Note that I was told by the folks at Ampere I should fill all six RAM slots in the machine for better scores, as the NUMA layout on the Ampere requires all channels to be full for best performance. Whether that affects Geekbench scores in particular remains to be seen, but judging by STH's testing on Epyc and Ampere, it seems that's a reasonable outcome.
I'll soon be ordering two more 16GB sticks of RAM to match what's already installed. Right now I have the Ampere workstation back in a box because I'm trying to wrap up another project that's taking up the majority of the space in my little workshop :( — I hope to get it back out in mid May.
as the NUMA layout on the Ampere requires all channels to be full for best performance
Oh, now with you mentioning NUMA I realize that there's no Ampere Altra with 96 cores so your box combines an AADP-64 in one CPU socket with an AADP-32 in the other? Unfortunately the 'info' collected by neofetch lacks most interesting bits.
Anway: once you unpack your box again I would really love to see sbc-bench -G
(executing Geekbench in monitored environment) or sbc-bench -j
and after the initial measurements then executing Geekbench in another Terminal. Since the tool has been designed to also provide hints as why scores differ (could be as simple as 'excessive swapping' since Geekbench needs an insane amount of memory per thread)
This chip is the M96-28, which is a 96-core part running at 2.8 GHz. I will try to get the machine up and running this week if I get a chance, and I'll give you more info. But might not be able to until may again.
It looks like their EDK2 fork just got a fix for the Windows BSOD I was running into (apparently it should've already worked on the 32/64/80 core Altra parts, but it needed fixing on Altra Max (96/128 core parts). Waiting for a new version of the firmware to be available and I'll test it out again!
(apparently it should've already worked on the 32/64/80 core Altra parts, but it needed fixing on Altra Max (96/128 core parts). Waiting for a new version of the firmware to be available and I'll test it out again!
that's what I get for thinking myself clever and having 96 core swapped into your PC🙄
Just send you the new release for edk2 that will enable you to run Windows 11 on the new Ampere Altra Max generation
I have Windows 11 Pro for Arm Insider Preview running now. A few stats:
Yes, it can play Crysis. Depending on your definition of the word 'play' ;)
I missed in the article an attempt to install esxi-arm on this platform and, for example, deploying k8s. Seems like it must have.
A few tips and tricks to boost HPL performance. (Visit Ampere Github for a detailed explanation)
Change HPL.dat input values to P = 8 and Q = 12 respectively.
Change HPL.dat Ns value to maximize the RAM utilization.
Use the Ampere-Oracle BLIS Math libraries. Instructions on how to install/use them are as follows.
git clone https://github.com/flame/blis.git MyBlisDir
change to MyBlisDir
#Switch to the new ampere branch
git checkout ampere
./QuickStart.sh altramax
#Ensure that the test bench contains Ampere Oracle Blis exported to PATH and LD_LIBRARY_PATH appropriately.
source ./blis_build_altramax.sh
source blis_setenv.sh
export LD_LIBRARY_PATH=<Install_Path>/MyBlisDir/lib/altramax
Use the attached Makefile to build the benchmark: Make.zip (I had to zip the makefile since I could not attach it directly)
Edit lines 70, 95,96 & 97 in the makefile to match your HPL and Ampere-Oracle BLIS installation paths.
Build the binary: make arch=Altramax_oracleblis -j
Run benchmark as usual.
I'm re-testing my system a bit using the above guidelines from @rbapat-ampere — thanks! See https://github.com/geerlingguy/top500-benchmark/issues/10
I'm re-testing my system a bit using the above guidelines from @rbapat-ampere — thanks! See geerlingguy/top500-benchmark#10
Thanks @geerlingguy . Also, if you'd like a a more detailed guide including performance expectations (ie scores that we are observing on our end) refer : https://github.com/AmpereComputing/HPL-on-Ampere-Altra
Interesting findings after upgrading to 96 GB of RAM (6 x 16 GB ECC DDR4 sticks)
That new Geekbench 5 multicore score bumps the system up to page 216 of the results
The Geekbench 6.1.0 for Windows aarch64 preview seems to work, the result is submitted, but then the console window it's running in disappears without giving a claim key. But here is the result of one of my test runs: https://browser.geekbench.com/v6/cpu/1567571 (1008 single core / 10481 multi core).
Watching Windows' built-in CPU core monitor, it seems less than half of the available 96 cores are actually in use, so the multi core score is wildly inaccurate. See: http://support.primatelabs.com/discussions/geekbench/82502-geekbench-6-doesnt-install-correctly-under-windows-on-arm-on-ampere
As an additional data point, with 96 GB of RAM, the measly 5647 Cinebench R23 score under Windows 11 for Arm bumps to a slightly-less-measly 6858 - that's a 20% speedup! Just imagine how well Cinebench would run if it weren't only optimized for X86 :)
Following the recommended installation for NVIDIA drivers for Linux, I am now seeing output through my 3080 Ti, and can have some more fun.
nvidia-smi
output:
jgeerling@ampere-altra:~$ nvidia-smi
Mon Jun 12 09:55:16 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.116.04 Driver Version: 525.116.04 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 0000000D:01:00.0 On | N/A |
| 48% 53C P0 111W / 400W | 315MiB / 12288MiB | 1% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 3631 G /usr/lib/xorg/Xorg 96MiB |
| 0 N/A N/A 3805 G /usr/bin/gnome-shell 128MiB |
| 0 N/A N/A 5079 G ...7/usr/lib/firefox/firefox 87MiB |
+-----------------------------------------------------------------------------+
glmark2 score:
jgeerling@ampere-altra:~$ glmark2
=======================================================
glmark2 2021.02
=======================================================
OpenGL Information
GL_VENDOR: NVIDIA Corporation
GL_RENDERER: NVIDIA GeForce RTX 3080 Ti/PCIe
GL_VERSION: 4.6.0 NVIDIA 525.116.04
=======================================================
[build] use-vbo=false: FPS: 6367 FrameTime: 0.157 ms
[build] use-vbo=true: FPS: 12194 FrameTime: 0.082 ms
[texture] texture-filter=nearest: FPS: 12209 FrameTime: 0.082 ms
[texture] texture-filter=linear: FPS: 12207 FrameTime: 0.082 ms
[texture] texture-filter=mipmap: FPS: 12224 FrameTime: 0.082 ms
[shading] shading=gouraud: FPS: 12028 FrameTime: 0.083 ms
[shading] shading=blinn-phong-inf: FPS: 11940 FrameTime: 0.084 ms
[shading] shading=phong: FPS: 11976 FrameTime: 0.084 ms
[shading] shading=cel: FPS: 11990 FrameTime: 0.083 ms
[bump] bump-render=high-poly: FPS: 11369 FrameTime: 0.088 ms
[bump] bump-render=normals: FPS: 12373 FrameTime: 0.081 ms
[bump] bump-render=height: FPS: 12112 FrameTime: 0.083 ms
[effect2d] kernel=0,1,0;1,-4,1;0,1,0;: FPS: 12015 FrameTime: 0.083 ms
[effect2d] kernel=1,1,1,1,1;1,1,1,1,1;1,1,1,1,1;: FPS: 11416 FrameTime: 0.088 ms
[pulsar] light=false:quads=5:texture=false: FPS: 12023 FrameTime: 0.083 ms
[desktop] blur-radius=5:effect=blur:passes=1:separable=true:windows=4: FPS: 5273 FrameTime: 0.190 ms
[desktop] effect=shadow:windows=4: FPS: 6034 FrameTime: 0.166 ms
[buffer] columns=200:interleave=false:update-dispersion=0.9:update-fraction=0.5:update-method=map: FPS: 1432 FrameTime: 0.698 ms
[buffer] columns=200:interleave=false:update-dispersion=0.9:update-fraction=0.5:update-method=subdata: FPS: 1760 FrameTime: 0.568 ms
[buffer] columns=200:interleave=true:update-dispersion=0.9:update-fraction=0.5:update-method=map: FPS: 1996 FrameTime: 0.501 ms
[ideas] speed=duration: FPS: 8453 FrameTime: 0.118 ms
[jellyfish] <default>: FPS: 11187 FrameTime: 0.089 ms
[terrain] <default>: FPS: 1677 FrameTime: 0.596 ms
[shadow] <default>: FPS: 10011 FrameTime: 0.100 ms
[refract] <default>: FPS: 7168 FrameTime: 0.140 ms
[conditionals] fragment-steps=0:vertex-steps=0: FPS: 12174 FrameTime: 0.082 ms
[conditionals] fragment-steps=5:vertex-steps=0: FPS: 12092 FrameTime: 0.083 ms
[conditionals] fragment-steps=0:vertex-steps=5: FPS: 12086 FrameTime: 0.083 ms
[function] fragment-complexity=low:fragment-steps=5: FPS: 12119 FrameTime: 0.083 ms
[function] fragment-complexity=medium:fragment-steps=5: FPS: 12069 FrameTime: 0.083 ms
[loop] fragment-loop=false:fragment-steps=5:vertex-steps=5: FPS: 12090 FrameTime: 0.083 ms
[loop] fragment-steps=5:fragment-uniform=false:vertex-steps=5: FPS: 12016 FrameTime: 0.083 ms
[loop] fragment-steps=5:fragment-uniform=true:vertex-steps=5: FPS: 11917 FrameTime: 0.084 ms
=======================================================
glmark2 Score: 9878
=======================================================
Super Tux Kart is giving 112 fps on max settings at 1440x900, and I'm going to test the Doom 3 install they have in their repo next.
The Doom 3 demo gave a constant 60fps, and I couldn't seem to get it unlocked. I was able to install openarena
as well, and that gave 1000 fps continuous when I set max framerate to 0 but it started getting jittery (I don't know if I could go beyond that), and if I set 900 or below, it would just lock that frame rate all day.
I'd say the 3080 Ti is complete overkill for 10-20+ year old games :D
I was trying to get Stable Diffusion running, but ran into some issues with CUDA + NVIDIA driver versions. I might attempt again with a completely clean install, but NVIDIA's support makes it all quite 'fun', as finding compatible driver / CUDA version pairings can be a challenge, and it doesn't help that most of the stable diffusion guides want to you to install their own mix of packages which may or may not break your working CUDA install, lol.
Using the optimized blis library from Oracle, I was able to get 985 Gflops at 270W, for a new efficiency of 3.64 Gflops/W
This chip is the M96-28, which is a 96-core part running at 2.8 GHz
Have you tried to verify the '2.8 GHz' (e.g. with sbc-bench -j
)?
The reason why I'm asking (link might not work with Safari for whatever reason) is Willy's cpufreq measurements here at the very bottom: https://www.cnx-software.com/2023/04/18/adlink-ampere-altra-dev-kit-features-atx-motherboard-with-32-to-80-core-arm-com-hpc-cpu-module/#comment-608818
New video today: https://www.youtube.com/watch?v=ydGdHjIncbk
And @ThomasKaiser - I'll try to get a run on it later today.
Hi Jeff,
To run my benchmark prime_threads
just follow the README. I've coded everything in C++ and its standard library.
My dev system is an Ubuntu laptop using an i7-12700H. My precise OS version is Ubuntu 22.04.2 LTS
64-bit.
So far on ARM I've only tried on a RPi4B without issue. One thing that might be a problem is that one LD flag is -lpthread
, though I could be wrong as I am only about 9 months into C++.
Repo (EDIT: Corrected the repo link): prime_threads
@ThomasKaiser - Posting results here: https://github.com/ThomasKaiser/sbc-bench/issues/72
jgeerling@adlink-ampere:~/sbc-bench$ sudo ./sbc-bench.sh
sbc-bench v0.9.42
Installing needed tools: apt -f -qq -y install lm-sensors sysstat powercap-utils p7zip, tinymembench, ramlat, mhz. Done.
Checking cpufreq OPP. Done (results will be available in 7-10 minutes).
Executing tinymembench. Done.
Executing RAM latency tester. Done.
Executing OpenSSL benchmark. Done.
Executing 7-zip benchmark. Done.
Checking cpufreq OPP again. Done (8 minutes elapsed).
Results validation:
* Measured clockspeed not lower than advertised max CPU clockspeed
* No swapping
* Background activity (%system) OK
* Throttling occured
Memory performance
memcpy: 10132.6 MB/s
memset: 44753.7 MB/s
7-zip total scores (3 consecutive runs): 249521,248641,249970, single-threaded: 3858
OpenSSL results:
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
aes-128-cbc 796165.06k 1576202.24k 2023527.51k 2171779.07k 2227093.50k 2232833.37k
aes-128-cbc 791871.84k 1579594.97k 2023827.97k 2169844.05k 2226918.74k 2232669.53k
aes-192-cbc 750426.13k 1387396.22k 1718240.43k 1807296.85k 1857686.19k 1861642.92k
aes-192-cbc 744861.90k 1387979.75k 1717576.70k 1807480.15k 1857890.99k 1861675.69k
aes-256-cbc 715428.22k 1227382.25k 1484381.87k 1563873.62k 1593264.81k 1596145.66k
aes-256-cbc 716891.95k 1230795.03k 1486725.72k 1563989.67k 1593251.16k 1596080.13k
Full results uploaded to http://ix.io/4zGI
@AlexLandherr - I ran the benchmark and noticed only one core was getting any load:
jgeerling@adlink-ampere:~/prime_benchmark$ ./prime_benchmark
Start prime_benchmark between 1 and 100000000? [Y/n]: Y
Starting prime benchmark...
Iteration 1 of 10 Runtime (ns): 98088310584
Iteration 2 of 10 Runtime (ns): 98097435960
Iteration 3 of 10 Runtime (ns): 98094310171
Iteration 4 of 10 Runtime (ns): 98087703900
Iteration 5 of 10 Runtime (ns): 98089309445
...
I think it might be faster if it ran on all cores :D
@AlexLandherr - I ran the benchmark and noticed only one core was getting any load:
jgeerling@adlink-ampere:~/prime_benchmark$ ./prime_benchmark Start prime_benchmark between 1 and 100000000? [Y/n]: Y Starting prime benchmark... Iteration 1 of 10 Runtime (ns): 98088310584 Iteration 2 of 10 Runtime (ns): 98097435960 Iteration 3 of 10 Runtime (ns): 98094310171 Iteration 4 of 10 Runtime (ns): 98087703900 Iteration 5 of 10 Runtime (ns): 98089309445 ...
I think it might be faster if it ran on all cores :D
Check the corrected link, I messed up the URLs. Correct one: https://github.com/AlexLandherr/prime_threads
EDIT 1: I hope it does better than the result from that old test program which I probably should make private again since it's legacy code to myself.
EDIT 2: Now the old repo is private. I've modified the prime_threads
source code a bit so that all the iteration times from a run are also written to the log file in the logs
directory.
If you successfully run it would it be okay with you if I link to your results in my README for prime_threads
?
Talking about reinventing the wheel...
I ran the benchmark and noticed only one core was getting any load
Same with Bruce's primes benchmark he introduced over there a while ago: https://hoult.org/primes.txt
I think it might be faster if it ran on all cores
I wonder which use case / workload such a multi-threaded 'benchmark' could represent since the whole working set fits inside CPU caches but usually server machines like these Ampere boxes run tasks that depend (heavily) on RAM access...
Talking about reinventing the wheel...
I ran the benchmark and noticed only one core was getting any load
Same with Bruce's primes benchmark he introduced over there a while ago: https://hoult.org/primes.txt
I think it might be faster if it ran on all cores
I wonder which use case / workload such a multi-threaded 'benchmark' could represent since the whole working set fits inside CPU caches but usually server machines like these Ampere boxes run tasks that depend (heavily) on RAM access...
I haven't thought about that so much, I mostly wrote mine for the purpose of learning more about the language itself and time seemed like an easily understandable measure.
@AlexLandherr - results:
jgeerling@adlink-ampere:~/prime_threads$ ./prime_threads
System supports 96 threads.
****
Search limit for all options are (both are inclusive limits):
Lower search limit: 1
Upper search limit: 100000000
Select benchmark mode from list below:
0. Single-threaded benchmark.
1. Multi-threaded benchmark (96 threads).
2. Exit program.
Enter one of the listed values: 1
Numerator (also upper search limit): 100000000
Denominator (also how many threads): 96
Quotient: 1041666
Remainder: 64
****
Starting multithreaded prime benchmark...
Started at: 2023-07-05 02:10:34 UTC
Iteration 1 of 10 Runtime (ns): 1650924933
Iteration 2 of 10 Runtime (ns): 1381674931
Iteration 3 of 10 Runtime (ns): 1381530659
Iteration 4 of 10 Runtime (ns): 1381874490
Iteration 5 of 10 Runtime (ns): 1382733573
Iteration 6 of 10 Runtime (ns): 1380218386
Iteration 7 of 10 Runtime (ns): 1392295779
Iteration 8 of 10 Runtime (ns): 1380862075
Iteration 9 of 10 Runtime (ns): 1381572084
Iteration 10 of 10 Runtime (ns): 1381265241
Prime multithreaded benchmark is done!
Stopped at: 2023-07-05 02:10:48 UTC
**** Results ****
Program mode is: Multithreaded
Thread count: 96
Search started at: 2023-07-05 02:10:34 UTC
Search ended at: 2023-07-05 02:10:48 UTC
Program ran for total of (DD:HH:MM:SS.SSSSSSSSS): 00:00:00:14.095112951
Program ran for: 14095112951 ns
Average time to find all primes between 1 and 100000000 was (DD:HH:MM:SS.SSSSSSSSS):
00:00:00:01.409495215
Average search time: 1409495215 ns
Number of primes found is: 5761455
@ThomasKaiser - I compiled @brucehoult's primes.txt and got:
jgeerling@adlink-ampere:~$ gcc -O3 primes.c -o primes
jgeerling@adlink-ampere:~$ ./primes
Starting run
3713160 primes found in 5641 ms
-480 bytes of code in countPrimes()
That seems slow to me, though—but the Ampere doesn't necessarily have the fastest single core performance, either...
(And with -01:)
3713160 primes found in 5830 ms
256 bytes of code in countPrimes()
I compiled @brucehoult's primes.txt ... That seems slow to me
Why? The Ampere's N1 cores are A76 siblings (see also the core IDs) and for an A76 at 2.8 GHz with these cache sizes this looks reasonable.
And we shouldn't forget that the 2.8 GHz are still questionable, see Willy's investigations really measuring clockspeeds since unfortunately we can't trust the values the cpufreq driver spits out in general :(
@AlexLandherr - results:
jgeerling@adlink-ampere:~/prime_threads$ ./prime_threads System supports 96 threads. **** Search limit for all options are (both are inclusive limits): Lower search limit: 1 Upper search limit: 100000000 Select benchmark mode from list below: 0. Single-threaded benchmark. 1. Multi-threaded benchmark (96 threads). 2. Exit program. Enter one of the listed values: 1 Numerator (also upper search limit): 100000000 Denominator (also how many threads): 96 Quotient: 1041666 Remainder: 64 **** Starting multithreaded prime benchmark... Started at: 2023-07-05 02:10:34 UTC Iteration 1 of 10 Runtime (ns): 1650924933 Iteration 2 of 10 Runtime (ns): 1381674931 Iteration 3 of 10 Runtime (ns): 1381530659 Iteration 4 of 10 Runtime (ns): 1381874490 Iteration 5 of 10 Runtime (ns): 1382733573 Iteration 6 of 10 Runtime (ns): 1380218386 Iteration 7 of 10 Runtime (ns): 1392295779 Iteration 8 of 10 Runtime (ns): 1380862075 Iteration 9 of 10 Runtime (ns): 1381572084 Iteration 10 of 10 Runtime (ns): 1381265241 Prime multithreaded benchmark is done! Stopped at: 2023-07-05 02:10:48 UTC **** Results **** Program mode is: Multithreaded Thread count: 96 Search started at: 2023-07-05 02:10:34 UTC Search ended at: 2023-07-05 02:10:48 UTC Program ran for total of (DD:HH:MM:SS.SSSSSSSSS): 00:00:00:14.095112951 Program ran for: 14095112951 ns Average time to find all primes between 1 and 100000000 was (DD:HH:MM:SS.SSSSSSSSS): 00:00:00:01.409495215 Average search time: 1409495215 ns Number of primes found is: 5761455
Thanks very much Jeff! Here's my result for the max power settings on my i7-12700H:
Program mode is: Multithreaded
Thread count: 20
Search started at: 2023-07-05 08:33:13 UTC
Search ended at: 2023-07-05 08:34:11 UTC
Program ran for total of (DD:HH:MM:SS.SSSSSSSSS): 00:00:00:58.569630496
Program ran for: 58569630496 ns
Average time to find all primes between 1 and 100000000 was (DD:HH:MM:SS.SSSSSSSSS):
00:00:00:05.856952223
Average search time: 5856952223 ns
Number of primes found is: 5761455
**** Runtime for each iteration ****
Iteration 1 of 10 Runtime (ns): 5806562542
Iteration 2 of 10 Runtime (ns): 5679193186
Iteration 3 of 10 Runtime (ns): 5739044034
Iteration 4 of 10 Runtime (ns): 5868012868
Iteration 5 of 10 Runtime (ns): 5759130597
Iteration 6 of 10 Runtime (ns): 5850852134
Iteration 7 of 10 Runtime (ns): 6006729002
Iteration 8 of 10 Runtime (ns): 5994649762
Iteration 9 of 10 Runtime (ns): 5895252589
Iteration 10 of 10 Runtime (ns): 5970095523
Program started at: 2023-07-05 08:33:04 UTC
Program ended at: 2023-07-05 08:34:11 UTC
Is it okay with you for me to link to your results in my README?
Is it okay with you for me to link to your results in my README?
Sure!
Is it okay with you for me to link to your results in my README?
Sure!
It's live since yesterday. I've been AFK from my dev laptop for about 48-ish hours.
Next up: learn more about make
(I've got a book for it)!
Was a little bored in the weekend and since you started this on Windows 11, had to try it on Ubuntu with a graphics card installed.
Most of the time it will not start up, when it does its a miracle, got it working with Nvidia RTX6000 card. Online instructions are terrible had to combine several directions to even get box86/64 correctly compiled and installed Borrowed some instructions from RP4, but build flags are a little different This is the steam.deb from official homepage
We did not run any games yet because when it aborts it takes many starts to get it back up. We will spend some time to document it.
A lot of complaints though asking for i386 libraries that are of course not available for arm (armhf)
@Bohanke - I was having similar issues with the i386 packages.
Rock solid
Start with fulll Ubuntu Server 20.04.5 install
$ sudo apt update $ sudo apt upgrade
$ sudo apt install tasksel $ sudo tasksel install ubuntu-desktop $ sudo systemctl set-default graphical.target $ sudo reboot
$ sudo nano /etc/default/grub Add “pcie_aspm=off” into the line of GRUB_CMDLINE_LINUX=”” and save the change. Execute the command to update the grub and then reboot: $ sudo update-grub
$ sudo touch /etc/modprobe.d/blacklist-nouveau.conf $ sudo nano /etc/modprobe.d/blacklist-nouveau.conf
add :
blacklist nouveau options nouveau modeset=0
save file
$ sudo update-initramfs -u
W: Possible missing firmware /lib/firmware/ast_dp501_fw.bin for module ast ignore it !
$ sudo reboot
install some tools for Nvidia graphics and maybe CUUDA later ?? $ sudo apt-get install g++ freeglut3-dev build-essential libx11-dev libxmu-dev libxi-dev libglu1-mesa-dev libfreeimage-dev libglfw3-dev
Goto Nvidia homepage to download the install package for your particular card !!! In this example the card is RTX6000
Copy the file into home directory, make sure its executable
run the installer $ sudo ./NVIDIA-Linux-aarch64-535.54.03.run
(do not do this in SSH shell, it has to be done local !!!!)
$ sudo reboot
BOX86 + SteAM and BOX64
sudo dpkg --add-architecture armhf
==== Do not compile right now because standard GCC and gcc-arm-linux-gnueabihf need for box86 are not clear to me !!!
Use @Itai-Nelken's apt repository to install precompiled box86 debs, updated weekly.
sudo wget https://itai-nelken.github.io/weekly-box86-debs/debian/box86.list -O /etc/apt/sources.list.d/box86.list wget -qO- https://itai-nelken.github.io/weekly-box86-debs/debian/KEY.gpg | sudo gpg --dearmor -o /etc/apt/trusted.gpg.d/box86-debs-archive-keyring.gpg sudo apt update && sudo apt install box86:armhf -y
Still use the script file to install steam but we need to clone git BOX86 sources for that
$ git clone https://github.com/ptitSeb/box86
$ cd box86
$ ./install_steam.sh
$ sudo wget https://ryanfortner.github.io/box64-debs/box64.list -O /etc/apt/sources.list.d/box64.list $ wget -qO- https://ryanfortner.github.io/box64-debs/KEY.gpg | sudo gpg --dearmor -o /etc/apt/trusted.gpg.d/box64-debs-archive-keyring.gpg $ sudo apt update && sudo apt install box64-generic-arm -y
maybe needed ?? $ sudo systemctl restart systemd-binfmt
$ sudo reboot
We, me and my son :)
made a target list for games to try out first. they are listed as Platinum where is comes to compatibility
Here is a lot of compatibility info : https://www.protondb.com/
Castle Crashers 255MB !!
Human: Fall Flat 500MB !!
Untitled Goose Game 800MB
Amazing Cultivation Sim. 1GB xxx
MudRunner 1GB
Cuphead 2GB
Inside 3GB
Deep Rock Galactic 3GB xxx
Pit People 3GB
Farming Simulator 15 3GB xxx
Risk of Rain 2 4GB
Grim Dawn 5GB xxx
What Remains of Edith Finch 5GB
Alien Swarm: Reactive Drop 6GB
ICEY 6GB
Rocket League 7GB
Outer Wilds 8GB xxx
Mordheim: City of the Damned 8GB
Dishonored 9GB
Little Nightmares 10GB Little Nightmares II 10GB DRAGON BALL XENOVERSE 10GB xxx Bloodstained: Ritual of the Night 10GB xxx Ori and the Blind Forest: Def Edition 11GB STEINS;GATE 13GB xxx Satisfactory 15GB xxx Atelier Ryza: Ever Darkness & Hideout 15GB xxx AI*Shoujo/AI*少女 15GB xxx A Plague Tale: Innocence 20GB xxx 鬼谷八荒 Tale of Immortal 25GB xxx Sekiro: Shadows Die Twice - GOTY Edition 25GB Ryse: Son of Rome 26GB Control Ultimate Edition 26GB Aliens: Fireteam Elite 30GB Wolfenstein II: The New Colossus 42GB Wolfenstein: The New Order 44GB The Witcher 3: Wild Hunt 50GB NieR:Automata 50GB ACE COMBAT 7: SKIES UNKNOWN 50GB SCARLET NEXUS 50GB Doom 55GB
xxx means maybe small issues
Still trying to replicate the 1.2+ Tflop results with the 96-core CPU on my Dev Workstation, but still only getting just under 1 Tflop: https://github.com/AmpereComputing/HPL-on-Ampere-Altra/issues/10
still only getting just under 1 Tflop
Have you tried to monitor benchmark execution this time? So do you really know your CPU cores clock at a certain clockspeed (I already mentioned the underlying issue here in this issue) and neither throttling nor swapping occured? Linpack FAQ clearly states:
If the problem size you pick is too large, swapping will occur, and the performance will drop
sbc-bench -R
is designed for just this: checking whether the execution of any other benchmarks does what it should. Executed in another terminal window prior to benchmarking and being stopped by [ctrl]-[c]
afterwards. Real clockspeeds (measured) will be reported prior/after benchmarking and whether throttling and/or swapping occured.
@ThomasKaiser - Yes, we've been monitoring that, at this point I went to the extent of buying new Samsung RAM to match the sticks in Ampere's test system, and we've seen that the Samsung x4 RAM sticks are producing 20-30% faster RAM benchmarks than the Transcend x8 sticks (specs otherwise identical, DDR4-3200 ECC CL22).
I just re-tested in tinymembench and the Samsung RAM is getting:
standard memcpy : 12107.0 MB/s
standard memset : 44746.4 MB/s
Whereas the Transcend RAM was getting:
standard memcpy : 9894.0 MB/s
standard memset : 44745.2 MB/s
For more details, see https://github.com/AmpereComputing/HPL-on-Ampere-Altra/issues/10#issuecomment-1711717393
Re-running Top500 HPL benchmark now.
Basic information
(This thing is... kinda more than a 'board' — but I still want data somewhere, and this is as good a place as any!)
Linux/system information
Benchmark results
CPU
Configured with 96 GB RAM (6 x 16GB DDR4 ECC Registered DIMMs):
Power
stress-ng --matrix 0
): 220 W (242W with 96 GB RAM)top500
HPL benchmark: 296 W (4.01 Gflops/W)Disk
Transcend 128GB PCIe Gen 3 x4 NVMe SSD (TS128GMTE652T)
curl https://raw.githubusercontent.com/geerlingguy/pi-cluster/master/benchmarks/disk-benchmark.sh | sudo bash
Run benchmark on any attached storage device (e.g. eMMC, microSD, NVMe, SATA) and add results under an additional heading. Download the script with
curl -o disk-benchmark.sh [URL_HERE]
and runsudo DEVICE_UNDER_TEST=/dev/sda DEVICE_MOUNT_PATH=/mnt/sda1 ./disk-benchmark.sh
(assuming the device issda
).Also consider running PiBenchmarks.com script.
PiBenchmarks.com result: TODO - should be on https://pibenchmarks.com/latest/ soon
Network
(Everything runs as expected... this thing's a bonafide server!)
GPU
Memory
tinymembench
results:Click to expand memory benchmark result
``` tinymembench v0.4.10 (simple benchmark for memory throughput and latency) ========================================================================== == Memory bandwidth tests == == == == Note 1: 1MB = 1000000 bytes == == Note 2: Results for 'copy' tests show how many bytes can be == == copied per second (adding together read and writen == == bytes would have provided twice higher numbers) == == Note 3: 2-pass copy means that we are using a small temporary buffer == == to first fetch data into it, and only then write it to the == == destination (source -> L1 cache, L1 cache -> destination) == == Note 4: If sample standard deviation exceeds 0.1%, it is shown in == == brackets == ========================================================================== C copy backwards : 9424.0 MB/s C copy backwards (32 byte blocks) : 9387.8 MB/s C copy backwards (64 byte blocks) : 9390.8 MB/s C copy : 9366.1 MB/s C copy prefetched (32 bytes step) : 9984.4 MB/s C copy prefetched (64 bytes step) : 9984.1 MB/s C 2-pass copy : 6391.4 MB/s C 2-pass copy prefetched (32 bytes step) : 7237.8 MB/s C 2-pass copy prefetched (64 bytes step) : 7489.6 MB/s C fill : 43884.4 MB/s C fill (shuffle within 16 byte blocks) : 43885.4 MB/s C fill (shuffle within 32 byte blocks) : 43884.2 MB/s C fill (shuffle within 64 byte blocks) : 43877.5 MB/s NEON 64x2 COPY : 9961.9 MB/s NEON 64x2x4 COPY : 10091.6 MB/s NEON 64x1x4_x2 COPY : 8171.5 MB/s NEON 64x2 COPY prefetch x2 : 11822.9 MB/s NEON 64x2x4 COPY prefetch x1 : 12123.8 MB/s NEON 64x2 COPY prefetch x1 : 11836.5 MB/s NEON 64x2x4 COPY prefetch x1 : 12122.3 MB/s --- standard memcpy : 9894.0 MB/s standard memset : 44745.2 MB/s --- NEON LDP/STP copy : 9958.0 MB/s NEON LDP/STP copy pldl2strm (32 bytes step) : 11415.6 MB/s NEON LDP/STP copy pldl2strm (64 bytes step) : 11420.5 MB/s NEON LDP/STP copy pldl1keep (32 bytes step) : 11475.2 MB/s NEON LDP/STP copy pldl1keep (64 bytes step) : 11452.9 MB/s NEON LD1/ST1 copy : 10094.8 MB/s NEON STP fill : 44744.7 MB/s NEON STNP fill : 44745.2 MB/s ARM LDP/STP copy : 10136.4 MB/s ARM STP fill : 44731.7 MB/s ARM STNP fill : 44730.0 MB/s ========================================================================== == Memory latency test == == == == Average time is measured for random memory accesses in the buffers == == of different sizes. The larger is the buffer, the more significant == == are relative contributions of TLB, L1/L2 cache misses and SDRAM == == accesses. For extremely large buffer sizes we are expecting to see == == page table walk with several requests to SDRAM for almost every == == memory access (though 64MiB is not nearly large enough to experience == == this effect to its fullest). == == == == Note 1: All the numbers are representing extra time, which needs to == == be added to L1 cache latency. The cycle timings for L1 cache == == latency can be usually found in the processor documentation. == == Note 2: Dual random read means that we are simultaneously performing == == two independent memory accesses at a time. In the case if == == the memory subsystem can't handle multiple outstanding == == requests, dual random read has the same timings as two == == single reads performed one after another. == ========================================================================== block size : single random read / dual random read, [MADV_NOHUGEPAGE] 1024 : 0.0 ns / 0.0 ns 2048 : 0.0 ns / 0.0 ns 4096 : 0.0 ns / 0.0 ns 8192 : 0.0 ns / 0.0 ns 16384 : 0.0 ns / 0.0 ns 32768 : 0.0 ns / 0.0 ns 65536 : 0.0 ns / 0.0 ns 131072 : 1.3 ns / 1.8 ns 262144 : 2.3 ns / 2.9 ns 524288 : 3.2 ns / 3.9 ns 1048576 : 3.6 ns / 4.2 ns 2097152 : 22.9 ns / 33.0 ns 4194304 : 32.6 ns / 40.9 ns 8388608 : 38.1 ns / 43.5 ns 16777216 : 43.2 ns / 48.6 ns 33554432 : 86.2 ns / 112.2 ns 67108864 : 109.3 ns / 135.2 ns block size : single random read / dual random read, [MADV_HUGEPAGE] 1024 : 0.0 ns / 0.0 ns 2048 : 0.0 ns / 0.0 ns 4096 : 0.0 ns / 0.0 ns 8192 : 0.0 ns / 0.0 ns 16384 : 0.0 ns / 0.0 ns 32768 : 0.0 ns / 0.0 ns 65536 : 0.0 ns / 0.0 ns 131072 : 1.3 ns / 1.8 ns 262144 : 1.9 ns / 2.3 ns 524288 : 2.2 ns / 2.5 ns 1048576 : 2.6 ns / 2.8 ns 2097152 : 21.6 ns / 31.6 ns 4194304 : 31.1 ns / 39.4 ns 8388608 : 35.8 ns / 41.7 ns 16777216 : 38.5 ns / 43.0 ns 33554432 : 79.9 ns / 104.9 ns 67108864 : 101.1 ns / 125.4 ns ```