Ricks-Lab / gpu-utils

A set of utilities for monitoring and customizing GPU performance
GNU General Public License v3.0
137 stars 23 forks source link

Fan Setting/Reading Issues #12

Closed Ricks-Lab closed 5 years ago

Ricks-Lab commented 5 years ago

UPDATE:possible bug. I've discovered that each time PAC is saved for a card, the fan PMW decreases 3%. If the fan PMW field is left blank or entered with a non-valid character, then the 3% decrease is from the last saved setting. There are some PMW % values that become set as entered, 0 (!), 20, 40, 60, 80, 100. This explains all previous "odd" behavior I've seen with PAC, so it's been there all along, it just took me awhile to figure it out (dang, sorry). So, yeah, some warning need to be inserted that an entry of zero means the fan will be shut off with possible damage to the card, or make zero a non-valid character (though I suppose some folk may want to shut off their fan??). And that 3% decrement is a bit quirky and confusing, either for amdgpu or for amdgpu-pac.

Originally posted by @csecht in https://github.com/Ricks-Lab/amdgpu-utils/issues/10#issuecomment-471977633

Ricks-Lab commented 5 years ago

@csecht I just checked the math and it looks right. I did make the assumption that all GPU would use the PWM range from 0 to 255. Can you verify this for your card? Determine the HWMON directory using amdgpu-ls then cat the files: pwm1_max and pwm1_max

csecht commented 5 years ago

Correct, for both cards pwm1_min is 0 and pwm1_max is 225. So why does PAC decrement PMW 3%? From amdgpu-ls, the RPM range for card1 (rx460) is 0, 6000 and for card0 (rx570) is 0, 3800, if that’s of any relevance.

On Mar 12, 2019, at 8:01 AM, Rick notifications@github.com wrote:

@csecht https://github.com/csecht I just checked the math and it looks right. I did make the assumption that all GPU would use the PWM range from 0 to 255. Can you verify this for your card? Determine the HWMON directory using amdgpu-ls then cat the files: pwm1_max and pwm1_max

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Ricks-Lab/amdgpu-utils/issues/12#issuecomment-471990458, or mute the thread https://github.com/notifications/unsubscribe-auth/AtlRQq-a0klnrReZoEaq7a9-qpFttdrQks5vV6UOgaJpZM4bq1ZV.

Ricks-Lab commented 5 years ago

Part of the issue could be that pac shows actual fan speed, instead of the setting. I will check if I can get the setting instead. Since there was no delay between pac display update and writing the settings, the fans are still slowing down, so if you hit refresh a bunch of times, you can see the setting change. I added a 500ms wait time after writing settings to minimize this effect. But even now, the actual fan speed can be very different than what you specify. I will continue to investigate this. The reset command just changes it to manual mode. The latest on master has the 500ms delay and basic function for Radeon VII.

csecht commented 5 years ago

I downloaded the latest master. I'm not sure there's any difference with how fans are set. I've done some more exploring into actual pmw values from amdgpu-ls, instead of just looking at % speed in the monitor. What I see is that with each PAC save the pmw decrements by a max value of 8. As the value gets closer to a stable setting (equivalent to 0, 20, 40, 60 80 100%), the decrement of the pmw setting becomes less until it hits a stable value. I only tested two stable points: pmw=102=40%, pmw=153=60% (same for both cards). PAC 'Save' no longer decrements pmw once one of these stable settings is reached. The observation of the 3% decrements (sometimes 2%) is a rounding or binning issue; for example, entered PAC value 48% and Save, then set value for pmw=122=45%, Save-> pmw=114=42%, Save-> pmw=107=40%, Save-> pmw=102=40%, Save-> pmw=102=40%, etc...

csecht commented 5 years ago

CORRECTION: First, I got all dyslexic and wrote pmw instead of pwm, sorry. The other error in my previous post is that I didn't get pwm values from ampgpu-ls, but from the ampgpu-pac --execute_pac terminal window as it printed the shell script on execution.

Ricks-Lab commented 5 years ago

I have checked it out here and it looks like the pwm values written to the card are correctly converted from the percentage value entered into the interface, but the resultant fan speed is different from what is specified. I will continue to investigate and research other implementations like rocm-smi.

csecht commented 5 years ago

UPDATE: I tested fan settings and readings with ROC-smi and it too does what ampgpu-pac does, it reports the fan speed (level) lower than what is set. A slight difference is that ROC-smi shows a 5 unit decrement in the fan level (on 0 to 255 scale), where amdgpu-pac & -monitor show a 8 unit decrement. Both programs have the same stable points where the reported and set speeds agree (40% in the example below). I was running amdgpu-monitor during this run through rocm-smi commands and it showed the the same integer values as rocm-smi for % fan speed (e.g. 47% for rocm-smi's 47.84%)

Trimmed terminal output from a sequence of rocm-smi commands:

$ ./rocm-smi -d 1 --setfan 50% GPU[1] : Successfully set fan speed to Level 127

$ ./rocm-smi -d 1 --showfan GPU[1] : Fan Level: 122 (47.84)%

$ ./rocm-smi -d 1 --setfan 122 GPU[1] : Successfully set fan speed to Level 122

$ ./rocm-smi -d 1 --showfan GPU[1] : Fan Level: 117 (45.88)%

$ ./rocm-smi -d 1 --setfan 40% GPU[1] : Successfully set fan speed to Level 102

$ ./rocm-smi -d 1 --showfan GPU[1] : Fan Level: 102 (40.0)%

Ricks-Lab commented 5 years ago

Seems to be a driver limitation. We have covered the anomaly in the user guide

Will revisit if behavior changes in newer driver or kernel releases.