Closed xw285cornell closed 2 years ago
If we do not reset the string to empty, this is going to happen:
ls -sail /sys/class/drm/ card2 -> ../../devices/pci0000:0d/0000:0d:00.0/0000:0e:00.0/0000:0f:01.0/0000:13:00.0/0000:14:01.0/0000:18:00.0/0000:19:00.0/0000:1a:00.0/drm/card2 card3 -> ../../devices/pci0000:0d/0000:0d:00.0/0000:0e:00.0/0000:0f:02.0/0000:1b:00.0/0000:1c:00.0/0000:1d:00.0/drm/card3
In the code:
path: /sys/class/drm/card3, tpath: ../../devices/pci0000:0d/0000:0d:00.0/0000:0e:00.0/0000:0f:02.0/0000:1b:00.0/0000:1c:00.0/0000:1d:00.0/drm/card30.0/0000:1a:00.0/drm/card2
So we're not resetting the string correctly and it's reusing the old card2 string. Then it's going to show:
rocm_smi.py --showbus GPU[2] : PCI Bus: 0000:1A:00.0 GPU[3] : PCI Bus: 0000:1A:00.0
After the fix: ================================== PCI Bus ID ================================== GPU[0] : PCI Bus: 0000:12:00.0 GPU[1] : PCI Bus: 0000:17:00.0 GPU[2] : PCI Bus: 0000:1A:00.0 GPU[3] : PCI Bus: 0000:1D:00.0 GPU[4] : PCI Bus: 0000:89:00.0 GPU[5] : PCI Bus: 0000:8E:00.0 GPU[6] : PCI Bus: 0000:91:00.0 GPU[7] : PCI Bus: 0000:94:00.0
Thanks! We've pulled this patch internally and will be in the ROCm 5.2 release (and hopefully 5.1 as well)
If we do not reset the string to empty, this is going to happen:
In the code:
So we're not resetting the string correctly and it's reusing the old card2 string. Then it's going to show:
After the fix: ================================== PCI Bus ID ================================== GPU[0] : PCI Bus: 0000:12:00.0 GPU[1] : PCI Bus: 0000:17:00.0 GPU[2] : PCI Bus: 0000:1A:00.0 GPU[3] : PCI Bus: 0000:1D:00.0 GPU[4] : PCI Bus: 0000:89:00.0 GPU[5] : PCI Bus: 0000:8E:00.0 GPU[6] : PCI Bus: 0000:91:00.0 GPU[7] : PCI Bus: 0000:94:00.0