geerlingguy / top500-benchmark

Automated Top500 benchmark for clusters or single nodes.
MIT License
169 stars 18 forks source link

Benchmark 128-core System76 Thelio Astra #44

Open geerlingguy opened 1 week ago

geerlingguy commented 1 week ago

The Thelio Astra has an M128-30 Ampere Altra Max CPU, and the configuration I was sent includes 512 GB of ECC DDR4-3200 RAM. See: https://github.com/geerlingguy/sbc-reviews/issues/53

geerlingguy commented 1 week ago

On my first run, it compiled and started the benchmark, but a few minutes in, after the system was consuming 430W or so continuously (my UPS beeped a bit as I passed its 600W threshold), I saw power draw drop to 38W, and the system seemed to be locked up. Even a reboot from OpenBMC didn't seem to restore it—it is stuck in power off state even if I try powering it on via BMC.

I had to manually power cycle the machine using the power button.

I'm also wondering... I never heard the fans spin up at all, they just stayed in their idle RPM AFAICT—maybe the fan curve or the fan control on the little breakout adapter isn't running correctly? I'll ask System76 if that could be the case.

geerlingguy commented 1 week ago

btop, hilariously, is displaying the CPU temp in thousands of degrees C:

Screenshot 2024-10-23 at 11 38 31 AM

I'm going to monitor temps with sensors on a 2nd benchmark run, maybe the fan curve needs fixing.

geerlingguy commented 1 week ago

It looks like cooling is the issue — I'll contact System76 and ask about it.

image

SoC temps got to 95°C and pegged around 250W, and would hover between 95-98°C. I also encountered a lockup during the 'Background Blur' benchmark on Geekbench 6, and I'm guessing it was also thermal throttling.

bexcran commented 1 week ago

btop, hilariously, is displaying the CPU temp in thousands of degrees C:

Press 'o' for options, press '2' for the cpu tab and scroll down to 'Cpu sensor'. Press the left/right arrows to select 'apm_xgene/SoC Temperature' instead of 'apm_xgene/IO power'.

geerlingguy commented 1 week ago

Ha! didn't even think of that. btop's been pretty reliable in picking the right metric on other platforms, this was the first time I tried it on Ampere. Will tuck that away when I get the Astra back — System76 is going to exchange systems as the one I was sent likely had shipping damage (one fan sounded horrible, and one of the CPU fans had dislodged from the CPU cooler, and was rattling around inside the cooling duct, likely banging into the motherboard).