Closed hairetikos closed 3 years ago
Mine does not even work. There is a really basic type conversion error instead
Line 368 in rocm_smi.py version 2.3.0 should read:
367: return False
368: if int(profile) > int(maxProfileLevel):
369: printLog(device, 'Unable to set profile to level' + str(profile))
just to clarify -- everything still seems to apply correctly, the compute mode is activated and the cards work fine in that mode ... it's just that every rocm-smi invocation after this setting is half-bodged and very slow. (regardless of whether the cards are running tasks or not)
this machine is using a 1x to 4x PCIe adapter switch to fit in extra GPUs (PCIE-EUX1-04 VER.002)
not sure if it could be contributing to the issue
ubuntu 18.04.2 bare metal (ASUS/PowerColor/XFX RX580 GPU):
amdgpu.vm_fragment_size=9 amdgpu.ppfeaturemask=0xffffffff
--opencl=legacy,pal --headless
just checked, dmesg is showing stuff like this after the rocm-smi freezes and goes slow before proceeding
[ 7473.794325] amdgpu: [powerplay]
last message was failed ret is 0
[ 7474.229309] amdgpu: [powerplay]
failed to send message 171 ret is 0
[ 7474.665543] amdgpu: [powerplay]
last message was failed ret is 0
[ 7475.100603] amdgpu: [powerplay]
failed to send message 171 ret is 0
[ 7475.536666] amdgpu: [powerplay]
last message was failed ret is 0
[ 7475.971775] amdgpu: [powerplay]
got some additional info:
rocm-smi works with GPUs 4 5 6 7 after the --setprofile
invocation fine
the problem is limited to GPU 1 2 3 (only 7 GPUs are in this system now)
reading the power from GPU 1 2 3 directly via sysfs (cat power1_average
) causes the temporary freeze, reading the temperatures are fine it seems
i have a hunch it could be the PCIE-EUX1-04 VER.002 card causing a strange setup
@rigred , getMaxLevel should return an int, which is why that conversion issue arose. I've got a fix for this coming in 2.5. Your workaround is a good compromise until that fix comes.
@hairetikos , I am wondering if it's the bridge well, since it's only on GPUs 1/2/3 . For 4/5/6/7, can you read the power correctly, it's only on the 1/2/3 bridge? I'd want to do some HW testing to tinker a bit. Things to test: 1-Swap 4/5/6/7 and 0/1/2/3 to see if the issue stays with the bridge, or the GPUs . If it's with the bridge, try just 1 GPU in the bridge, then 2 GPUs, etc. If it's only when 4 GPUs are in one bridge, that's something. Also, it could be a combination of the bus+bridge, so if 1/2 works and 3/4 doesn't, that's also useful to know. 2-Swapping PCIe buses for the bridges, as it could be something with the PCIe bus that the 0-4 bridge is in. If it stays with the bus, try a single GPU in that PCIe bus instead of using the bridge and see. It could be faulty, or it might just have issues handling the bridge 3-Try remove 4/5/6/7 and leave the 1/2/3 bridge in there. If it works, then it could be something like either the PSU having issues with 7 GPUs, or it could be the PCIe bus isn't handling running bridges on 2 buses at a time.
It's a lot of swapping work, but it will definitely help to isolate the issue. If we can determine if it's a hardware thing, then we're golden. If we can't find anything conclusive from the HW swapping, that still gives us information with which we can keep investigating. Good luck!
2.5 has the type fix, the bridge issue is still something we need to look at. Any update?
i don't have the rig with the RX580s to test this on unfortunately but i still have the PCIe splitter/switch and will be testing it with 4 Radeon VIIs soon
if i cannot reproduce the issue with the R7s then i have 2 RX580s i can try with it to reproduce the issue
i don't think the PSU is the issue, 1.8kW GameMax mining PSU and the RX580s power limited to 90W each
@hairetikos Any luck with 2.6?
@kentrussell unfortunately i've not got the RX580s to test anymore and the PCIe splitter/bridge im not using with the Radeon VIIs as i think it was causing other issues
i'm happy to test anything else with the radeon VIIs just not with the bridge/splitter
Sorry for the delay, this was resolved in ROCm 3.7 in the kernel. If you have any issues, please open a new issue at https://github.com/RadeonOpenCompute/rocm_smi_lib, as this repo will be deprecated and all SMI CLI functionality has moved over there. Thank you!
after setting the profile to compute mode with --setprofile X, rocm-smi applies those settings very slowly then, every other invocation of rocm-smi becomes slow and there are some errors
(ie: invoking
rocm-smi
alone takes about 1 minute after showing "ROCm System Management Interface ..." before displaying values, 4 rows of them wrong)Ubuntu 18.04, latest amdgpu-pro 8x RX580