Closed Delaunay closed 4 years ago
The issue actually lies in the VBIOS. For cards that have passive fans or use server coolers, some VBIOSes actually report that there is an air-cooled fan and reports those garbage numbers which are just whatever those registers have in them. The fan speed/control is only valid on cards that have an air-cooling fan on the card itself that is attached to the card, with the fan controller on the card as well. Otherwise, there is just junk data in there (and sysfs can't control coolers that are not connected to the card itself). I've found that our internal VBIOSes tend to say that there is air-cooling when there is actually none, so some of our partners may have fallen into that same situation.
To clarify, the cards are all passively cooled, right? If so, then the fan control and fan speed reporting won't be useful, unfortunately.
No, only the MI are passively cooled. The other values for the RadeonVII used the standard market cards with the fans.
Oh ok, sorry that I misunderstood there. So we've got 2 parts. At least #1 I explained reasonably well.
The other config both cards are Radeon VII's, and the 2nd one is just reporting the wrong fan. Is it actually cycling up to 100% (e.g. the fan gets super loud), or does it stay down? Is it limited to that card, or does it occur on the other card as well (and if so, is it intermittent, or only when in a certain PCIe bus).
What if you try to specify device #2 with "rocm-smi -d 2 --setfan 100%" , does it still end up doing the same thing where it accepts the value and does nothing? If the VBIOS has its own fan tuning (Sapphire does this), then it can sometimes override the kernel's fan settings, but I want to try to debug it a bit first to ensure that we cover all of the standard avenues before just assuming that it's another case like that.
Is it actually cycling up to 100% (e.g. the fan gets super loud), or does it stay down?
Yes, I checked both are running at 100%.
Is it limited to that card, or does it occur on the other card as well
We only have one machine with RadeonVII. So I cannot tell sorry.
What if you try to specify device #2 with "rocm-smi -d 2 --setfan 100%"
Still shows below 100% fan speed.
> rocm-smi -d 2 --setfan 100%
=====================ROCm System Management Interface========================
GPU[2] : Successfully set fan speed to Level 255
==============================End of ROCm SMI Log ==============================
> rocm-smi
========================ROCm System Management Interface========================
================================================================================
GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap SCLK OD MCLK OD GPU%
1 72.0c 219.0W 1802Mhz 1001Mhz 100.0% auto 250.0W 0% 0% 100%
2 32.0c 20.0W 701Mhz 351Mhz 18.82% auto 250.0W 0% 0% 0%
================================================================================
==============================End of ROCm SMI Log ==============================
Peculiar. The percentage reported there is just the result of 2 sysfs files: fan/fanmax (where fan is pwm1 and fanmax is pwm1_max). If you are setting the fan to 100% (255) and pwm1 is still reporting as 73 or some other low number, then that's a kernel bug with the hwmon reporting the wrong value.
Is it limited to just that card? If you remove the other RadeonVII card, does it show the same behaviour? If the card still runs at 100%, even though it's only reporting that it's running at 10%, then I don't think there's a hardware issue. After bumping it up to 100%, are there any powerplay-related messages in dmesg to note?
This should be addressed with an updated VBIOS. Please try that out and re-open this if the issue isn't resolved (had to get the VBIOS guys to update some invalid fields)
On a system with 3 MI cards (i.e fanless/passively cooled). I get the following output. I expect the 6.67% fan to be 0.00%.
The issue is also present with RadeonVII cards. When I manually set the the fans to 100%, rocm-smi says the fan is only running at 10.98%.
Additionally, you can find the raw values from sysfs below. MI machine:
Radeon VII machine: