Ricks-Lab / gpu-utils

A set of utilities for monitoring and customizing GPU performance
GNU General Public License v3.0
139 stars 23 forks source link

RX7900 (gfx11) Cards Fan Control is Not Functional #140

Open PorcelainMouse opened 1 year ago

PorcelainMouse commented 1 year ago

I don't actually think this is a problem with gpu-utils code, but I'm not sure. On my 7900 card, gpu-ls & -mon seem to work fine, but gpu-pac doesn't detect any writable cards, even when setting feature mask correctly. Also, I cannot manually change /sys/.../pwm1_enable value by writing to it; the write succeeds, but value is unchanged.

I see lots of complaints about this all the way back to launch date, but I think it's weird that it is still broken, considering how long it's been, and that fan control seems like a hardware core capability, not just some software feature that is nice to have, and that the /sys interface has been really stable for a long time, and all those bits seem to be there for this card, it seems very odd that isn't not functional.

Ricks-Lab commented 1 year ago

Can you provide the output of gpu-ls --raw

PorcelainMouse commented 1 year ago

Thank you! Yes, but later. So sorry. Card was horribly unstable under load; I'm very worried. I'm hoping it's just too hot due to lack of fan, so I had to remove the card. Will get another chance to test next weekend.

Ricks-Lab commented 1 year ago

No worries. I found plenty to work on from the results for my RX6600.

PorcelainMouse commented 1 year ago

Do you want that with or without feature mask set as described by gpu-pac?

Ricks-Lab commented 1 year ago

I think it needs to be with the feature mask set. I need the information after this header in the raw output:

### File: pp_od_clk_voltage, SensorKey: pp_od_clk_voltage, Label: read/write driver file
Ricks-Lab commented 1 year ago

I have been making changes to better support GPUs with Voltage Offset setting. I know this is available in RX66xx. The requested output above should confirm if RX77xx has this capability also. Incompletely tested code can be bound in the new branch: pp_feature_refine

PorcelainMouse commented 1 year ago

OMG, finally got my card back from RMA...no change, symptoms exactly as before. I don't know what I'm gong to do.

But, I got the data you requested. I really hope this helps. gpu-ls-raw-mask.txt

Since last we spoke, there's been loads of new kernels, and even a new OpenCL lib for my distro that was supposed to make things better. But, everything looks the same, AFAICT.

Ricks-Lab commented 1 year ago

The pp_od_clk_voltage file appears empty. This could indicate driver doesn't support overclocking. Are you running the latest version from the repository? I suggest cloning the latest from GitHub and explicitly run that version. I did make some changes to support latest GPUs, but I still don't have a RX7900 to try. What other issues do you have that caused you to RMA it?

PorcelainMouse commented 1 year ago

You mean the latest AMDGPU? I'm not quite sure what you mean. I'm using the latest available for my distribution, which uses a very recent kernel, more recent than most distros. The only way I know to get a different one is to build it myself, from Torvald's upstream. I've compiled many kernels back in the day, but it's been about 20 years since then; I don't relish the idea. I assume I can build it, but then I have to hack my distro's kernel management, which is yet another hurdle. I'm sure I could figure it out, but I'm avoiding it.

I see some version information, but it's really hard to know, because the internal versions are different than version numbers AMD reports. It's hard to know how to match these up. When I first reported this issue, the internal version was 58.86.0, now it's 78.75.0, FWIW. But, sure, there is clearly a newer version of MESA, although I don't think that is related. But, AFAIK, the newest AMDGPU is in Torvald's copy and I'm relatively close to that. I cannot install AMDGPU-PRO, not that I want to, but it's not even an option.

My system is very unstable. It have frequent crashes while crunching with OpenCL. However, this is not unfamiliar, as I had similar behavior with 5700XT before I got fan control working and 6800XT when I forget to start fan control. If edge temps exceed 52 deg, I see increased computation errors, higher and I get crashes. So, this is consistent with prior experience and I currently don't have any evidence this is a different issue related to the RDNA3 arch or 7900 series hardware, specifically, or software. Perhaps, the situation with the 7900xtx is worse; it seems to happen in 20-60 minutes, where as my 5700xt only crashed that frequently after several months of running hot without fan control.

PorcelainMouse commented 1 year ago

Actually, I take that back. There is a newer version of x11-amdgpu, but I'm using wayland. MESA is current (23.1).