ROCm / ROC-smi

ROC System Management Interface
https://github.com/RadeonOpenCompute/ROC-smi/blob/master/README.md
179 stars 55 forks source link

fixed a bug causing KeyError in setpoweroverdrive #63

Closed clamcy closed 5 years ago

clamcy commented 5 years ago

[Error Log] sudo rocm-smi --setpoweroverdrive 125 Traceback (most recent call last): File "/usr/bin/rocm-smi", line 1910, in setPowerOverDrive(deviceList, args.setpoweroverdrive, args.autorespond) File "/usr/bin/rocm-smi", line 1396, in setPowerOverDrive power_cap_path = getFilePath(device, 'power1_cap') File "/usr/bin/rocm-smi", line 130, in getFilePath pathDict = valuePaths[key] KeyError: 'power1_cap' [FIX] Line 1396,change power1_cap to power_cap fixed the error.

jlgreathouse commented 5 years ago

What kernel and/or driver version are you using that you are seeing this problem?

All of the following versions of the amdgpu drivers surface this mechanism as the sysfs entry power1_cap:

clamcy commented 5 years ago

I am using driver "amdgpu-pro-19.10-785425-ubuntu-18.04.tar.xz". Maybe ROCm driver and amdgpu-pro are two different things? My mistake.

jlgreathouse commented 5 years ago

Hi @devclam

It appears that our amdgpu-pro 19.10 drivers also still user power1_cap, so I'm still a bit confused where this problem is coming from.

Could you show me the output of the following command on your system?

for i in `ls -1 /sys/class/hwmon/`; do echo ${i}; cat /sys/class/hwmon/${i}/device/vendor; cat /sys/class/hwmon/${i}/device/device; ls /sys/class/hwmon/${i}; done
clamcy commented 5 years ago

Sorry I am now using this GPU with Windows, so cannot give you detailed output now. But I can explain why I change power1_cap to power_cap. From the following code:

# rocm-smi.py line 1396, in setPowerOverDrive
power_cap_path = getFilePath(device, 'power1_cap')

and

# line 123 - 130
def getFilePath(device, key):
    pathDict = valuePaths[key]

we can see that the program is looking up key in the valuePaths dictionary, the key is 'power1_cap'.

# line 81 - 106
valuePaths = {
    #...
    'power' : {'prefix' : hwmonprefix, 'filepath' : 'power1_average', 'needsparse' : True},
    'power_cap' : {'prefix' : hwmonprefix, 'filepath' : 'power1_cap', 'needsparse' : False},
    'power_cap_max' : {'prefix' : hwmonprefix, 'filepath' : 'power1_cap_max', 'needsparse' : False},
    'power_cap_min' : {'prefix' : hwmonprefix, 'filepath' : 'power1_cap_min', 'needsparse' : False}
    #...
}

The valuePaths dictionary has only power_cap key, not power1_cap key. Then I changed the code in line 1396 from power1_cap to power_cap, and the program seems to be working.

jlgreathouse commented 5 years ago

Ah! Thank you for pointing that out. I had not noticed that the mechanism for getting the location of the power1_cap file had changed between ROCm 2.2 and ROCm 2.3 -- I am still running 2.2 on my systems for various internal testing purposes.

You're absolutely right, we used to just hard-code the power1_cap location, but the code changed to use a new getFilePath mechanism that needs the properly key from the valuePaths table. I didn't notice this new mechanism, so I thought your actual file system entry had changed to power_cap. Thank you for pointing this out.

@kentrussell I think we should merge this patch internally for release at AMD's earliest convenience.

kentrussell commented 5 years ago

It's internal now, it'll make it for 2.5. Thanks, and sorry guys... sigh