Dr-Noob / cpufetch

Simple yet fancy CPU architecture fetching tool
GNU General Public License v2.0
1.91k stars 103 forks source link

Crashes on aarch64 due to off-by-one errors #264

Open suve opened 3 months ago

suve commented 3 months ago

Using the latest cpufetch v1.06, the program builds fine, but then crashes when ran on aarch64. Debugging with Valgrind led me to find two off-by-one errors:


Commit 1b746bc67d58a6f317233440be3e0336ecfcf8ac introduced this code to src/common.pci.c:

    struct pci_device* dev = pci->devices[i];
    int path_size = strlen(PCI_PATH) + strlen(dev->path) + 2;

    // Read vendor_id
    char *vendor_id_path = emalloc(sizeof(char) * (path_size + strlen("vendor")));
    sprintf(vendor_id_path, "%s/%s/%s", PCI_PATH, dev->path, "vendor");

/* [...] some code omitted for brevity [...] */

    // Read device_id
    char *device_id_path = emalloc(sizeof(char) * (path_size + strlen("device")));
    sprintf(device_id_path, "%s/%s/%s", PCI_PATH, dev->path, "device");

When calculating path_size, the code correctly adds 2 bytes for the extra slashes introduced by the sprintf() call. However, both emalloc() calls fail to add another extra byte for the NUL terminator.


Commit 90624b9aaa778db324f79095ae2e10ae50b29011 introduced the following change in src/arm/midr.c:

 char* get_str_features(struct cpuInfo* cpu) {
   struct features* feat = cpu->feat;
-  uint32_t max_len = strlen("NEON,SHA1,SHA2,AES,CRC32,") + 1;
+  uint32_t max_len = strlen("NEON,SHA1,SHA2,AES,CRC32,SVE,SVE2") + 1;
   uint32_t len = 0;
   char* string = ecalloc(max_len, sizeof(char));

If you look carefully, you'll notice the old string was terminated with a comma, whereas the new one is not.

Dr-Noob commented 3 months ago

Thank you for reporting this.

I haven't noticed it in any of my ARM machines because at least the first one will only occur if the SoC cannot be detected with the "main" detection methods. Strangely, it was not noted even for users that would run that function (e.g., #262). It's a pity v1.06 comes with this, but at least it's fixed now on master. Can you confirm if this works for you?

Also, would you mind opening another issue with the machine where it was crashing without this last patch? I assume it will give you unkown SoC. Thanks!

suve commented 3 months ago

first one will only occur if the SoC cannot be detected with the "main" detection methods

That seems to be precisely the case here.

SoC:                 Unknown
Technology:          Unknown
Microarchitecture:   Neoverse N1
Max Frequency:       3.000 GHz
Cores:               80 cores
Features:            NEON,SHA1,SHA2,AES,CRC32
Peak Performance:    3.84 TFLOP/s

It's a pity v1.06 comes with this, but at least it's fixed now on master. Can you confirm if this works for you?

I patched the program myself, but looking at commit aa94389bbe1209faa0f98b031d5cc6b39ec22e74, our fixes are identical. I can confirm that the program works fine with those changes.

Also, would you mind opening another issue with the machine where it was crashing without this last patch?

The second issue was actually something I've stumbled across by accident while reading the code; it didn't actually trigger a crash on this machine. Sorry, should have made that clear.

Dr-Noob commented 3 months ago

Yeah well, I meant opening a new issue to track the unknown SoC (the machine you have showed). But we can try to solve that here if you want.

If that's the case, what machine are we talking about in this case? Model? Also, what is the output of cpufetch --verbose? cat /proc/device-tree/compatible? Anything that could be indicating the model under lspci -nn?

suve commented 2 months ago

So the story is, I maintain the cpufetch package in Fedora Linux. I observed these crashes when trying to build the package using the official distro builders. Since I have rather limited access to these machines, I asked for help on the devel mailing list. Here's the answer from Kevin Fenzi:

So, there's 4 kinds of aarch64 builders:

1. Our oldest ones, buildhw-a64-01/02,07/08/09/10/11/12/13/21/22:

These are lenovo emags.
lscpu says:

Vendor ID:                APM
  BIOS Vendor ID:         Ampere(TM)
  Model name:             -
    BIOS Model name:      eMAG   CPU @ 3.0GHz

cpufetch says:

[WARNING]: SoC detection failed using /proc/cpuinfo: No string found
[WARNING]: read_file: /sys/bus/nvmem/devices/rockchip-efuse0/nvmem: No such file or directory
[WARNING]: read_file: /sys/bus/nvmem/devices/rockchip-otp0/nvmem: No such file or directory
[WARNING]: guess_soc_from_uarch: No uarch matched the list
[WARNING]: guess_soc_from_pci: No PCI device matched the list

                                          SoC:                 Unknown
   #####  ##   # #####  ## ####  ######   Technology:          Unknown
 ###    ####   ###      ####  ###   ###   Microarchitecture:   Xgene
###       ##   ###      ###    ##    ###  Max Frequency:       3.000 GHz
 ###    ####   ###      ###    ##    ###  Cores:               32 cores
  ######  ##   ###      ###    ##    ###  Features:            NEON,SHA1,SHA2,AES,CRC32
                                          Peak Performance:    768.00 GFLOP/s
No /proc/device-tree/compatible

2. We have two thunderx2's, buildhw-a64-19/20:

Vendor ID:                Cavium
  BIOS Vendor ID:         Cavium Inc.
  Model name:             ThunderX2-99xx
    BIOS Model name:      Cavium ThunderX2(R) CPU CN9975 v2.1 @ 2.20GHz CN9975-2200BG4077-PR21
                          -Y-G CPU @ 2.2GHz

                                          SoC:                 Unknown
   #####  ##   # #####  ## ####  ######   Technology:          Unknown
 ###    ####   ###      ####  ###   ###   Microarchitecture:   ThunderX2 99XX
###       ##   ###      ###    ##    ###  Max Frequency:       ~2.140 GHz
 ###    ####   ###      ###    ##    ###  Cores:               224 cores
  ######  ##   ###      ###    ##    ###  Features:            NEON,SHA1,SHA2,AES,CRC32
                                          Peak Performance:    3.83 TFLOP/s
There's a ton of lines for each cpu like: 
...
[WARNING]: Could not open '/sys/devices/system/cpu/cpu223/cpufreq/cpuinfo_max_freq'
[WARNING]: Unable to fetch max frequency for core 223. This is probably because the core is offline
[WARNING]: Could not open '/sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq'
[WARNING]: Unable to find max frequency from udev, measuring CPU frequency
[WARNING]: Running frequency measurement with 5250000 iterations on core 0...
cpufetch is measuring the max frequency...[WARNING]: SoC detection failed using /proc/cpuinfo: No string found
[WARNING]: read_file: /sys/bus/nvmem/devices/rockchip-efuse0/nvmem: No such file or directory
[WARNING]: read_file: /sys/bus/nvmem/devices/rockchip-otp0/nvmem: No such file or directory
[WARNING]: guess_soc_from_uarch: No uarch matched the list
[WARNING]: guess_soc_from_pci: No PCI device matched the list

3. We have 4 newer mt snow boxes: buildhw-a64-03/04/05/06

Vendor ID:                ARM
  BIOS Vendor ID:         Ampere(R)
  Model name:             Neoverse-N1
    BIOS Model name:      Ampere(R) Altra(R) Processor Q80-30 CPU @ 3.0GHz
    BIOS CPU family:      257
    Model:                1

[WARNING]: SoC detection failed using /proc/cpuinfo: No string found
[WARNING]: read_file: /sys/bus/nvmem/devices/rockchip-efuse0/nvmem: No such file or directory
[WARNING]: read_file: /sys/bus/nvmem/devices/rockchip-otp0/nvmem: No such file or directory
[WARNING]: guess_soc_from_uarch: No uarch matched the list
[WARNING]: guess_soc_from_pci: No PCI device matched the list

                                          SoC:                 Unknown
   #####  ##   # #####  ## ####  ######   Technology:          Unknown
 ###    ####   ###      ####  ###   ###   Microarchitecture:   Neoverse N1
###       ##   ###      ###    ##    ###  Max Frequency:       3.000 GHz
 ###    ####   ###      ###    ##    ###  Cores:               80 cores
  ######  ##   ###      ###    ##    ###  Features:            NEON,SHA1,SHA2,AES,CRC32
                                          Peak Performance:    3.84 TFLOP/s

4. There's a bunch of buildvm-a64's on newer mt snow boxes:

Vendor ID:                ARM
  BIOS Vendor ID:         QEMU
  Model name:             Neoverse-N1
    BIOS Model name:      virt-rhel9.2.0  CPU @ 2.0GHz
    BIOS CPU family:      1

...
[WARNING]: Could not open '/sys/devices/system/cpu/cpu11/cpufreq/cpuinfo_max_freq'
[WARNING]: Unable to fetch max frequency for core 11. This is probably because the core is offline
[WARNING]: Could not open '/sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq'
[WARNING]: Unable to find max frequency from udev, measuring CPU frequency
[WARNING]: Running frequency measurement with 7250000 iterations on core 0...
cpufetch is measuring the max frequency...[WARNING]: SoC detection failed using /proc/cpuinfo: No string found
[WARNING]: read_file: /sys/bus/nvmem/devices/rockchip-efuse0/nvmem: No such file or directory
[WARNING]: read_file: /sys/bus/nvmem/devices/rockchip-otp0/nvmem: No such file or directory
[WARNING]: guess_soc_from_uarch: No uarch matched the list
[WARNING]: guess_soc_from_pci: No PCI device matched the list

                                          SoC:                 Unknown
   #####  ##   # #####  ## ####  ######   Technology:          Unknown
 ###    ####   ###      ####  ###   ###   Microarchitecture:   Neoverse N1
###       ##   ###      ###    ##    ###  Max Frequency:       ~2.990 GHz
 ###    ####   ###      ###    ##    ###  Cores:               12 cores
  ######  ##   ###      ###    ##    ###  Features:            NEON,SHA1,SHA2,AES,CRC32
                                          Peak Performance:    574.08 GFLOP/s

Looking at my build history, the crashing builds occurred on type 1, 3 and 4.

Sorry for the late answer. Feel free to ask questions in the mailing list thread to avoid using me as an intermediary.

Dr-Noob commented 2 months ago

I think participating in the mailing list is the wrong thing to do for me. The issues should all be collected in one place (here). Spreading them across different places does not seem the most convinient approach for a developer. The issues page is always open for those who want to report any issue. In particular, I'm interested in knowing more about the sentence:

"I suspect upstream needs to generalise the way they deal with some of the pieces, happy to assist with comments upstream."

I'm afraid that if we want to keep the customized logo for each vendor and the manufacturing process this is going to be hard to accomplish.

From the machines you have shown, the first and the second one could be potentially fixed, for which I'd need the output of hexdump -C /proc/device-tree/compatible (I'm actively working on adding support for ARM servers before they are reported in the issues page but it's not my priority right now. I'd be happy to accept PR to improve the support for this). The third machine is already supported (#262, you would need to build from source) and the last one seems to be a VM, which I don't support.

ausil commented 2 months ago

I know none of those machines use device tree, they all use ACPI to define the hardware.

Dr-Noob commented 2 months ago

In such case, what do you think would be a non-root approach to uniquely identify the SoC?