ROCm / rocm_smi_lib

ROCm SMI LIB
https://rocm.docs.amd.com/projects/rocm_smi_lib/en/latest/
MIT License
111 stars 48 forks source link

Fix string corruption while read the device folder #100

Closed xw285cornell closed 2 years ago

xw285cornell commented 2 years ago

If we do not reset the string to empty, this is going to happen:

ls -sail /sys/class/drm/
card2 -> ../../devices/pci0000:0d/0000:0d:00.0/0000:0e:00.0/0000:0f:01.0/0000:13:00.0/0000:14:01.0/0000:18:00.0/0000:19:00.0/0000:1a:00.0/drm/card2
card3 -> ../../devices/pci0000:0d/0000:0d:00.0/0000:0e:00.0/0000:0f:02.0/0000:1b:00.0/0000:1c:00.0/0000:1d:00.0/drm/card3

In the code:

path: /sys/class/drm/card3, tpath: ../../devices/pci0000:0d/0000:0d:00.0/0000:0e:00.0/0000:0f:02.0/0000:1b:00.0/0000:1c:00.0/0000:1d:00.0/drm/card30.0/0000:1a:00.0/drm/card2

So we're not resetting the string correctly and it's reusing the old card2 string. Then it's going to show:

rocm_smi.py --showbus
GPU[2]      : PCI Bus: 0000:1A:00.0
GPU[3]      : PCI Bus: 0000:1A:00.0

After the fix: ================================== PCI Bus ID ================================== GPU[0] : PCI Bus: 0000:12:00.0 GPU[1] : PCI Bus: 0000:17:00.0 GPU[2] : PCI Bus: 0000:1A:00.0 GPU[3] : PCI Bus: 0000:1D:00.0 GPU[4] : PCI Bus: 0000:89:00.0 GPU[5] : PCI Bus: 0000:8E:00.0 GPU[6] : PCI Bus: 0000:91:00.0 GPU[7] : PCI Bus: 0000:94:00.0

kentrussell commented 2 years ago

Thanks! We've pulled this patch internally and will be in the ROCm 5.2 release (and hopefully 5.1 as well)