flightlessmango / MangoHud

A Vulkan and OpenGL overlay for monitoring FPS, temperatures, CPU/GPU load and more. Discord: https://discordapp.com/invite/Gj5YmBb
MIT License
6.11k stars 263 forks source link

unit test failing on s390x (big endian) #1330

Closed maribu closed 1 month ago

maribu commented 1 month ago

Describe the bug

The unit test test_amdgpu_get_instant_metrics fails with

stderr:
[  ERROR   ] --- 0x4000 != 0x40
[   LINE   ] --- ../tests/test_amdgpu.cpp:35: error: Failure!

on s390x (which is big endian), but success on quite a bunch of other architectures (which are all little endian). Also 0x4000 would be the result when storing 0x40 as uint16_t on a little endian and loading it directly into memory on a big endian system.

List relevant hardware/software information

To Reproduce Steps to reproduce the behavior:

Expected behavior All unit tests succeed

Screenshots Not applicable

Additional context None

maribu commented 1 month ago

Note: It is pretty unlikely that an s390x is commonly found in a gaming setup :)

flightlessmango commented 1 month ago

This should be resolved here faa3b1c22f09bb1d6161dfc1d656046347b2420b

maribu commented 1 month ago

https://github.com/flightlessmango/MangoHud/commit/faa3b1c22f09bb1d6161dfc1d656046347b2420b is also applying le16toh() on float and bool variables, but only uint16_t variables can be converted using le16oh(). If sizeof(bool) == 1, boolwould not need to be converted, as there aren't two ways how to store one byte in memory.

C will implicitly convert an float to an uint16_t (e.g. in case of 6.161000 it will be rounded to 6) when passing it as argument to le16toh(), this output will be 6 on little endian systems, which finally is implicitly converted back to float as 6.000000. This results than in:

[  ERROR   ] --- 6.000000 != 6.161000
[   LINE   ] --- ../tests/test_amdgpu.cpp:80: error: Failure!

If the float indeed is big endian (which may not be the case on big endian systems, as the FPU could still work in little endian), you could do something along the lines of:

static inline float floattoh(float input)
{
    uint32_t tmp = le32toh(*((uint32_t *)&input));
    return *((float *)&tmp);
}

Btw: https://github.com/flightlessmango/MangoHud/commit/faa3b1c22f09bb1d6161dfc1d656046347b2420b is based on the assumption that only the test input needs to be converted to big endian, but on a live system the metrics would be in big endian. I'm unfamiliar with both MangoHud's code and how AMD GPUs work, but this does look odd to me. I'd assume that the metrics file would be written by the GPU, and so the GPU's endianess (which will be little endian) would matter. So the conversion should rather be in amdgpu_get_instant_metrics(), if my assumption is correct.

flightlessmango commented 1 month ago

Here's another attempt! dc7ec945497df202e2b0ebfae709b1d4bcd2a8ba hopefully I understood this better now

maribu commented 1 month ago

With that the test sadly still fails on s390x: https://gitlab.alpinelinux.org/maribu/aports/-/jobs/1394508. I think the

Btw: https://github.com/flightlessmango/MangoHud/commit/faa3b1c22f09bb1d6161dfc1d656046347b2420b is based on the assumption that only the test input needs to be converted to big endian, but on a live system the metrics would be in big endian.

I did dig a bit deeper to be sure. I understand the doc at https://dri.freedesktop.org/docs/drm/gpu/amdgpu.html#gpu-metrics that the data is not passed through from the GPU directly, but rather the driver is getting the individual readings from the GPU. In this case it would be reasonable to assume that the data would indeed be in host endian, but the endianess seems not documented anywhere.

In the implementation of the kernel fetching the data boils down to a call to smu_cmn_send_smc_msg_with_param(), which does seem to fetch the metrics from the GPU in the byte order of the GPU.

I've asked in the Alpine IRC if anyone has access to an s390x machine with an AMD GPU and is willing to share an actual gpu_metrics sysfs file from that. With that available, it should be easy to confirm if the metrics are in host endian from the CPU point-of-view or in host ednian from the GPU point-of-view (so little endian).

maribu commented 1 month ago

The guys at the Alpine IRC suggest to revert the fix and close this issue as won't fix, as s390x mainframes used as gaming setup is so unlikely the effort is not worth it.

It seems that even PowerPC is using little endian these days because of GPU driver compatibility :) (And in fact, Alpine's 64 bit PowerPC port is little-endian.)

flightlessmango commented 1 month ago

Alright! The changes have been reverted