ROCm / ROCm

AMD ROCm™ Software - GitHub Home
https://rocm.docs.amd.com
MIT License
4.62k stars 385 forks source link

Radeon Instinct MI25 MxGPU not detected by ROCm #1638

Closed LeonSpark closed 9 months ago

LeonSpark commented 2 years ago

Problem:

neither /opt/rocm-4.5.0/bin/rocminfo nor/opt/rocm-4.5.0/opencl/bin/clinfo detect my MI25 GPU clinfo.txt rocminfo.txt

Environment

ROCmSupport commented 2 years ago

Thanks @LeonSpark for reaching out. I certainly understood the problem. I have tried MI25 on Ubuntu 20.04.3 and I am NOT able to reproduce the issue. Can you please check once again on the same machine by uninstalling and installing rocm. Request to share dmesg output also.

LeonSpark commented 2 years ago

Thanks @ROCmSupport for the reply! I have retried on Ubuntu 20.04.3 Installation script attached as below:

sudo apt-get update
sudo apt-get install wget gnupg2
sudo usermod -a -G video $LOGNAME
sudo usermod -a -G render $LOGNAME
sudo apt-get update
wget https://repo.radeon.com/amdgpu-install/21.40.2/ubuntu/focal/amdgpu-install_21.40.2.40502-1_all.deb
sudo apt-get install ./amdgpu-install_21.40.2.40502-1_all.deb
sudo apt-get update
sudo amdgpu-install --usecase=rocm
sudo reboot

This time, both rocminfo and clinfo don't output the expected results

/opt/rocm-4.5.2/bin/rocminfo
ROCk module is NOT loaded, possibly no GPU devices

/opt/rocm-4.5.2/opencl/bin/clinfo
Number of platforms:                             1
  Platform Profile:                              FULL_PROFILE
  Platform Version:                              OpenCL 2.2 AMD-APP (3361.0)
  Platform Name:                                 AMD Accelerated Parallel Processing
  Platform Vendor:                               Advanced Micro Devices, Inc.
  Platform Extensions:                           cl_khr_icd cl_amd_event_callback
  Platform Name:                                 AMD Accelerated Parallel Processing
Number of devices:                               0

dmesg output is attached for your reference, many thanks. dmesg.txt

LeonSpark commented 2 years ago

One more thing, after restarting the amdgpu driver is not loaded. I'm sure the dkms is installed.

sudo amdgpu-install --usecase=dkms

Reading package lists... Done
Building dependency tree
Reading state information... Done
linux-headers-5.11.0-1022-azure is already the newest version (5.11.0-1022.23~20.04.1).
linux-modules-extra-5.11.0-1022-azure is already the newest version (5.11.0-1022.23~20.04.1).
amdgpu-dkms is already the newest version (1:5.11.32.40502-1350682).
0 upgraded, 0 newly installed, 0 to remove and 16 not upgraded.

sudo lshw -class display

 *-display
       description: VGA compatible controller
       product: Hyper-V virtual VGA
       vendor: Microsoft Corporation
       physical id: 8
       bus info: pci@0000:00:08.0
       version: 00
       width: 32 bits
       clock: 33MHz
       capabilities: vga_controller bus_master rom
       configuration: driver=hyperv_fb latency=0
       resources: irq:11 memory:f8000000-fbffffff memory:c0000-dffff
  *-display UNCLAIMED
       description: VGA compatible controller
       product: Vega 10 [Radeon Instinct MI25 MxGPU]
       vendor: Advanced Micro Devices, Inc. [AMD/ATI]
       physical id: 1
       bus info: pci@af8c:00:00.0
       version: 00
       width: 64 bits
       clock: 33MHz
       capabilities: pciexpress msi msix vga_controller cap_list
       configuration: latency=0
       resources: iomemory:f0-ef iomemory:f0-ef memory:fe0000000-fefffffff memory:ff0000000-ff01fffff memory:40080000-400fffff

The video device is UNCLAIMED

suijth commented 2 years ago

facing the exact same issue. Azure Machine: NV4as_v4 OS: Ubuntu 18.04

LeonSpark commented 2 years ago

Thanks @ROCmSupport for the reply! I have retried on Ubuntu 20.04.3 Installation script attached as below:

sudo apt-get update
sudo apt-get install wget gnupg2
sudo usermod -a -G video $LOGNAME
sudo usermod -a -G render $LOGNAME
sudo apt-get update
wget https://repo.radeon.com/amdgpu-install/21.40.2/ubuntu/focal/amdgpu-install_21.40.2.40502-1_all.deb
sudo apt-get install ./amdgpu-install_21.40.2.40502-1_all.deb
sudo apt-get update
sudo amdgpu-install --usecase=rocm
sudo reboot

This time, both rocminfo and clinfo don't output the expected results

/opt/rocm-4.5.2/bin/rocminfo
ROCk module is NOT loaded, possibly no GPU devices

/opt/rocm-4.5.2/opencl/bin/clinfo
Number of platforms:                             1
  Platform Profile:                              FULL_PROFILE
  Platform Version:                              OpenCL 2.2 AMD-APP (3361.0)
  Platform Name:                                 AMD Accelerated Parallel Processing
  Platform Vendor:                               Advanced Micro Devices, Inc.
  Platform Extensions:                           cl_khr_icd cl_amd_event_callback
  Platform Name:                                 AMD Accelerated Parallel Processing
Number of devices:                               0

dmesg output is attached for your reference, many thanks. dmesg.txt

@ROCmSupport Could someone help to take a look at dmesg output? thanks!

ROCmSupport commented 2 years ago

Hi @LeonSpark I have gone through the dmesg and found that its CPU soft lockup. watchdog: BUG: soft lockup - CPU#2 stuck for 40s! [swapper/0:1] As per my experience, this is not a common issue. This issue is specific to your config only. Now not seen in my configs and anywhere else. I do not think its due to ROCm. I too have seen this kind of problem once/twice in one of the specific machines long back(an year ago) and the issue is gone automatically after some days.

ROCmSupport commented 2 years ago

Hi @LeonSpark I hope this issue is fixed now, recommend to check with the latest ROCm 5.1 and update. Thank you.

nartmada commented 9 months ago

Closing the ticket as it has become stale. @LeonSpark, please open another ticket for any new issue. Thanks.