Open suve opened 3 months ago
Thank you for reporting this.
I haven't noticed it in any of my ARM machines because at least the first one will only occur if the SoC cannot be detected with the "main" detection methods. Strangely, it was not noted even for users that would run that function (e.g., #262). It's a pity v1.06 comes with this, but at least it's fixed now on master. Can you confirm if this works for you?
Also, would you mind opening another issue with the machine where it was crashing without this last patch? I assume it will give you unkown SoC. Thanks!
first one will only occur if the SoC cannot be detected with the "main" detection methods
That seems to be precisely the case here.
SoC: Unknown
Technology: Unknown
Microarchitecture: Neoverse N1
Max Frequency: 3.000 GHz
Cores: 80 cores
Features: NEON,SHA1,SHA2,AES,CRC32
Peak Performance: 3.84 TFLOP/s
It's a pity v1.06 comes with this, but at least it's fixed now on master. Can you confirm if this works for you?
I patched the program myself, but looking at commit aa94389bbe1209faa0f98b031d5cc6b39ec22e74, our fixes are identical. I can confirm that the program works fine with those changes.
Also, would you mind opening another issue with the machine where it was crashing without this last patch?
The second issue was actually something I've stumbled across by accident while reading the code; it didn't actually trigger a crash on this machine. Sorry, should have made that clear.
Yeah well, I meant opening a new issue to track the unknown SoC (the machine you have showed). But we can try to solve that here if you want.
If that's the case, what machine are we talking about in this case? Model? Also, what is the output of cpufetch --verbose
? cat /proc/device-tree/compatible
? Anything that could be indicating the model under lspci -nn
?
So the story is, I maintain the cpufetch package in Fedora Linux. I observed these crashes when trying to build the package using the official distro builders. Since I have rather limited access to these machines, I asked for help on the devel mailing list. Here's the answer from Kevin Fenzi:
So, there's 4 kinds of aarch64 builders:
1. Our oldest ones, buildhw-a64-01/02,07/08/09/10/11/12/13/21/22:
These are lenovo emags.
lscpu says:
Vendor ID: APM
BIOS Vendor ID: Ampere(TM)
Model name: -
BIOS Model name: eMAG CPU @ 3.0GHz
cpufetch says:
[WARNING]: SoC detection failed using /proc/cpuinfo: No string found
[WARNING]: read_file: /sys/bus/nvmem/devices/rockchip-efuse0/nvmem: No such file or directory
[WARNING]: read_file: /sys/bus/nvmem/devices/rockchip-otp0/nvmem: No such file or directory
[WARNING]: guess_soc_from_uarch: No uarch matched the list
[WARNING]: guess_soc_from_pci: No PCI device matched the list
SoC: Unknown
##### ## # ##### ## #### ###### Technology: Unknown
### #### ### #### ### ### Microarchitecture: Xgene
### ## ### ### ## ### Max Frequency: 3.000 GHz
### #### ### ### ## ### Cores: 32 cores
###### ## ### ### ## ### Features: NEON,SHA1,SHA2,AES,CRC32
Peak Performance: 768.00 GFLOP/s
No /proc/device-tree/compatible
2. We have two thunderx2's, buildhw-a64-19/20:
Vendor ID: Cavium
BIOS Vendor ID: Cavium Inc.
Model name: ThunderX2-99xx
BIOS Model name: Cavium ThunderX2(R) CPU CN9975 v2.1 @ 2.20GHz CN9975-2200BG4077-PR21
-Y-G CPU @ 2.2GHz
SoC: Unknown
##### ## # ##### ## #### ###### Technology: Unknown
### #### ### #### ### ### Microarchitecture: ThunderX2 99XX
### ## ### ### ## ### Max Frequency: ~2.140 GHz
### #### ### ### ## ### Cores: 224 cores
###### ## ### ### ## ### Features: NEON,SHA1,SHA2,AES,CRC32
Peak Performance: 3.83 TFLOP/s
There's a ton of lines for each cpu like:
...
[WARNING]: Could not open '/sys/devices/system/cpu/cpu223/cpufreq/cpuinfo_max_freq'
[WARNING]: Unable to fetch max frequency for core 223. This is probably because the core is offline
[WARNING]: Could not open '/sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq'
[WARNING]: Unable to find max frequency from udev, measuring CPU frequency
[WARNING]: Running frequency measurement with 5250000 iterations on core 0...
cpufetch is measuring the max frequency...[WARNING]: SoC detection failed using /proc/cpuinfo: No string found
[WARNING]: read_file: /sys/bus/nvmem/devices/rockchip-efuse0/nvmem: No such file or directory
[WARNING]: read_file: /sys/bus/nvmem/devices/rockchip-otp0/nvmem: No such file or directory
[WARNING]: guess_soc_from_uarch: No uarch matched the list
[WARNING]: guess_soc_from_pci: No PCI device matched the list
3. We have 4 newer mt snow boxes: buildhw-a64-03/04/05/06
Vendor ID: ARM
BIOS Vendor ID: Ampere(R)
Model name: Neoverse-N1
BIOS Model name: Ampere(R) Altra(R) Processor Q80-30 CPU @ 3.0GHz
BIOS CPU family: 257
Model: 1
[WARNING]: SoC detection failed using /proc/cpuinfo: No string found
[WARNING]: read_file: /sys/bus/nvmem/devices/rockchip-efuse0/nvmem: No such file or directory
[WARNING]: read_file: /sys/bus/nvmem/devices/rockchip-otp0/nvmem: No such file or directory
[WARNING]: guess_soc_from_uarch: No uarch matched the list
[WARNING]: guess_soc_from_pci: No PCI device matched the list
SoC: Unknown
##### ## # ##### ## #### ###### Technology: Unknown
### #### ### #### ### ### Microarchitecture: Neoverse N1
### ## ### ### ## ### Max Frequency: 3.000 GHz
### #### ### ### ## ### Cores: 80 cores
###### ## ### ### ## ### Features: NEON,SHA1,SHA2,AES,CRC32
Peak Performance: 3.84 TFLOP/s
4. There's a bunch of buildvm-a64's on newer mt snow boxes:
Vendor ID: ARM
BIOS Vendor ID: QEMU
Model name: Neoverse-N1
BIOS Model name: virt-rhel9.2.0 CPU @ 2.0GHz
BIOS CPU family: 1
...
[WARNING]: Could not open '/sys/devices/system/cpu/cpu11/cpufreq/cpuinfo_max_freq'
[WARNING]: Unable to fetch max frequency for core 11. This is probably because the core is offline
[WARNING]: Could not open '/sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq'
[WARNING]: Unable to find max frequency from udev, measuring CPU frequency
[WARNING]: Running frequency measurement with 7250000 iterations on core 0...
cpufetch is measuring the max frequency...[WARNING]: SoC detection failed using /proc/cpuinfo: No string found
[WARNING]: read_file: /sys/bus/nvmem/devices/rockchip-efuse0/nvmem: No such file or directory
[WARNING]: read_file: /sys/bus/nvmem/devices/rockchip-otp0/nvmem: No such file or directory
[WARNING]: guess_soc_from_uarch: No uarch matched the list
[WARNING]: guess_soc_from_pci: No PCI device matched the list
SoC: Unknown
##### ## # ##### ## #### ###### Technology: Unknown
### #### ### #### ### ### Microarchitecture: Neoverse N1
### ## ### ### ## ### Max Frequency: ~2.990 GHz
### #### ### ### ## ### Cores: 12 cores
###### ## ### ### ## ### Features: NEON,SHA1,SHA2,AES,CRC32
Peak Performance: 574.08 GFLOP/s
Looking at my build history, the crashing builds occurred on type 1, 3 and 4.
Sorry for the late answer. Feel free to ask questions in the mailing list thread to avoid using me as an intermediary.
I think participating in the mailing list is the wrong thing to do for me. The issues should all be collected in one place (here). Spreading them across different places does not seem the most convinient approach for a developer. The issues page is always open for those who want to report any issue. In particular, I'm interested in knowing more about the sentence:
"I suspect upstream needs to generalise the way they deal with some of the pieces, happy to assist with comments upstream."
I'm afraid that if we want to keep the customized logo for each vendor and the manufacturing process this is going to be hard to accomplish.
From the machines you have shown, the first and the second one could be potentially fixed, for which I'd need the output of hexdump -C /proc/device-tree/compatible
(I'm actively working on adding support for ARM servers before they are reported in the issues page but it's not my priority right now. I'd be happy to accept PR to improve the support for this). The third machine is already supported (#262, you would need to build from source) and the last one seems to be a VM, which I don't support.
I know none of those machines use device tree, they all use ACPI to define the hardware.
In such case, what do you think would be a non-root approach to uniquely identify the SoC?
Using the latest cpufetch v1.06, the program builds fine, but then crashes when ran on aarch64. Debugging with Valgrind led me to find two off-by-one errors:
Commit 1b746bc67d58a6f317233440be3e0336ecfcf8ac introduced this code to
src/common.pci.c
:When calculating
path_size
, the code correctly adds 2 bytes for the extra slashes introduced by thesprintf()
call. However, bothemalloc()
calls fail to add another extra byte for the NUL terminator.Commit 90624b9aaa778db324f79095ae2e10ae50b29011 introduced the following change in
src/arm/midr.c
:If you look carefully, you'll notice the old string was terminated with a comma, whereas the new one is not.