ROCm / ROC-smi

ROC System Management Interface
https://github.com/RadeonOpenCompute/ROC-smi/blob/master/README.md
179 stars 55 forks source link

invoking `rocm-smi --setprofile X` on RX580 glitches rocm-smi #67

Closed hairetikos closed 3 years ago

hairetikos commented 5 years ago

after setting the profile to compute mode with --setprofile X, rocm-smi applies those settings very slowly then, every other invocation of rocm-smi becomes slow and there are some errors

(ie: invoking rocm-smi alone takes about 1 minute after showing "ROCm System Management Interface ..." before displaying values, 4 rows of them wrong)

========================ROCm System Management Interface========================
================================================================================
GPU  Temp   AvgPwr   SCLK     MCLK    Fan     Perf    PwrCap  SCLK OD  MCLK OD  GPU%  
0    36.0c  N/A      300Mhz   300Mhz  71.76%  manual  90.0W   0%       0%       0%    
2    37.0c  N/A      300Mhz   300Mhz  71.76%  manual  90.0W   0%       0%       0%    
3    39.0c  N/A      300Mhz   300Mhz  71.76%  manual  90.0W   0%       0%       0%    
4    39.0c  N/A      300Mhz   300Mhz  71.76%  manual  90.0W   0%       0%       0%    
5    35.0c  43.12W   1266Mhz  300Mhz  71.76%  manual  90.0W   0%       0%       0%    
6    30.0c  36.148W  1145Mhz  300Mhz  71.76%  manual  90.0W   0%       0%       0%    
7    36.0c  38.252W  1162Mhz  300Mhz  71.76%  manual  90.0W   0%       0%       0%    
8    33.0c  38.248W  1154Mhz  300Mhz  71.76%  manual  90.0W   0%       0%       0%    
================================================================================
==============================End of ROCm SMI Log ==============================

Ubuntu 18.04, latest amdgpu-pro 8x RX580

rigred commented 5 years ago

Mine does not even work. There is a really basic type conversion error instead image

Line 368 in rocm_smi.py version 2.3.0 should read:

367:            return False
368:        if int(profile) > int(maxProfileLevel):
369:            printLog(device, 'Unable to set profile to level' + str(profile))
hairetikos commented 5 years ago

just to clarify -- everything still seems to apply correctly, the compute mode is activated and the cards work fine in that mode ... it's just that every rocm-smi invocation after this setting is half-bodged and very slow. (regardless of whether the cards are running tasks or not)

hairetikos commented 5 years ago

this machine is using a 1x to 4x PCIe adapter switch to fit in extra GPUs (PCIE-EUX1-04 VER.002)

not sure if it could be contributing to the issue

ubuntu 18.04.2 bare metal (ASUS/PowerColor/XFX RX580 GPU):

hairetikos commented 5 years ago

just checked, dmesg is showing stuff like this after the rocm-smi freezes and goes slow before proceeding

[ 7473.794325] amdgpu: [powerplay] 
                last message was failed ret is 0
[ 7474.229309] amdgpu: [powerplay] 
                failed to send message 171 ret is 0 
[ 7474.665543] amdgpu: [powerplay] 
                last message was failed ret is 0
[ 7475.100603] amdgpu: [powerplay] 
                failed to send message 171 ret is 0 
[ 7475.536666] amdgpu: [powerplay] 
                last message was failed ret is 0
[ 7475.971775] amdgpu: [powerplay] 
hairetikos commented 5 years ago

got some additional info:

rocm-smi works with GPUs 4 5 6 7 after the --setprofile invocation fine the problem is limited to GPU 1 2 3 (only 7 GPUs are in this system now)

reading the power from GPU 1 2 3 directly via sysfs (cat power1_average) causes the temporary freeze, reading the temperatures are fine it seems

i have a hunch it could be the PCIE-EUX1-04 VER.002 card causing a strange setup

kentrussell commented 5 years ago

@rigred , getMaxLevel should return an int, which is why that conversion issue arose. I've got a fix for this coming in 2.5. Your workaround is a good compromise until that fix comes.

@hairetikos , I am wondering if it's the bridge well, since it's only on GPUs 1/2/3 . For 4/5/6/7, can you read the power correctly, it's only on the 1/2/3 bridge? I'd want to do some HW testing to tinker a bit. Things to test: 1-Swap 4/5/6/7 and 0/1/2/3 to see if the issue stays with the bridge, or the GPUs . If it's with the bridge, try just 1 GPU in the bridge, then 2 GPUs, etc. If it's only when 4 GPUs are in one bridge, that's something. Also, it could be a combination of the bus+bridge, so if 1/2 works and 3/4 doesn't, that's also useful to know. 2-Swapping PCIe buses for the bridges, as it could be something with the PCIe bus that the 0-4 bridge is in. If it stays with the bus, try a single GPU in that PCIe bus instead of using the bridge and see. It could be faulty, or it might just have issues handling the bridge 3-Try remove 4/5/6/7 and leave the 1/2/3 bridge in there. If it works, then it could be something like either the PSU having issues with 7 GPUs, or it could be the PCIe bus isn't handling running bridges on 2 buses at a time.

It's a lot of swapping work, but it will definitely help to isolate the issue. If we can determine if it's a hardware thing, then we're golden. If we can't find anything conclusive from the HW swapping, that still gives us information with which we can keep investigating. Good luck!

kentrussell commented 5 years ago

2.5 has the type fix, the bridge issue is still something we need to look at. Any update?

hairetikos commented 5 years ago

i don't have the rig with the RX580s to test this on unfortunately but i still have the PCIe splitter/switch and will be testing it with 4 Radeon VIIs soon

if i cannot reproduce the issue with the R7s then i have 2 RX580s i can try with it to reproduce the issue

i don't think the PSU is the issue, 1.8kW GameMax mining PSU and the RX580s power limited to 90W each

kentrussell commented 5 years ago

@hairetikos Any luck with 2.6?

hairetikos commented 5 years ago

@kentrussell unfortunately i've not got the RX580s to test anymore and the PCIe splitter/bridge im not using with the Radeon VIIs as i think it was causing other issues

i'm happy to test anything else with the radeon VIIs just not with the bridge/splitter

kentrussell commented 3 years ago

Sorry for the delay, this was resolved in ROCm 3.7 in the kernel. If you have any issues, please open a new issue at https://github.com/RadeonOpenCompute/rocm_smi_lib, as this repo will be deprecated and all SMI CLI functionality has moved over there. Thank you!