Closed LeonSpark closed 9 months ago
Thanks @LeonSpark for reaching out. I certainly understood the problem. I have tried MI25 on Ubuntu 20.04.3 and I am NOT able to reproduce the issue. Can you please check once again on the same machine by uninstalling and installing rocm. Request to share dmesg output also.
Thanks @ROCmSupport for the reply! I have retried on Ubuntu 20.04.3 Installation script attached as below:
sudo apt-get update
sudo apt-get install wget gnupg2
sudo usermod -a -G video $LOGNAME
sudo usermod -a -G render $LOGNAME
sudo apt-get update
wget https://repo.radeon.com/amdgpu-install/21.40.2/ubuntu/focal/amdgpu-install_21.40.2.40502-1_all.deb
sudo apt-get install ./amdgpu-install_21.40.2.40502-1_all.deb
sudo apt-get update
sudo amdgpu-install --usecase=rocm
sudo reboot
This time, both rocminfo
and clinfo
don't output the expected results
/opt/rocm-4.5.2/bin/rocminfo
ROCk module is NOT loaded, possibly no GPU devices
/opt/rocm-4.5.2/opencl/bin/clinfo
Number of platforms: 1
Platform Profile: FULL_PROFILE
Platform Version: OpenCL 2.2 AMD-APP (3361.0)
Platform Name: AMD Accelerated Parallel Processing
Platform Vendor: Advanced Micro Devices, Inc.
Platform Extensions: cl_khr_icd cl_amd_event_callback
Platform Name: AMD Accelerated Parallel Processing
Number of devices: 0
dmesg output is attached for your reference, many thanks. dmesg.txt
One more thing, after restarting the amdgpu driver is not loaded. I'm sure the dkms
is installed.
sudo amdgpu-install --usecase=dkms
Reading package lists... Done
Building dependency tree
Reading state information... Done
linux-headers-5.11.0-1022-azure is already the newest version (5.11.0-1022.23~20.04.1).
linux-modules-extra-5.11.0-1022-azure is already the newest version (5.11.0-1022.23~20.04.1).
amdgpu-dkms is already the newest version (1:5.11.32.40502-1350682).
0 upgraded, 0 newly installed, 0 to remove and 16 not upgraded.
sudo lshw -class display
*-display
description: VGA compatible controller
product: Hyper-V virtual VGA
vendor: Microsoft Corporation
physical id: 8
bus info: pci@0000:00:08.0
version: 00
width: 32 bits
clock: 33MHz
capabilities: vga_controller bus_master rom
configuration: driver=hyperv_fb latency=0
resources: irq:11 memory:f8000000-fbffffff memory:c0000-dffff
*-display UNCLAIMED
description: VGA compatible controller
product: Vega 10 [Radeon Instinct MI25 MxGPU]
vendor: Advanced Micro Devices, Inc. [AMD/ATI]
physical id: 1
bus info: pci@af8c:00:00.0
version: 00
width: 64 bits
clock: 33MHz
capabilities: pciexpress msi msix vga_controller cap_list
configuration: latency=0
resources: iomemory:f0-ef iomemory:f0-ef memory:fe0000000-fefffffff memory:ff0000000-ff01fffff memory:40080000-400fffff
The video device is UNCLAIMED
facing the exact same issue. Azure Machine: NV4as_v4 OS: Ubuntu 18.04
Thanks @ROCmSupport for the reply! I have retried on Ubuntu 20.04.3 Installation script attached as below:
sudo apt-get update sudo apt-get install wget gnupg2 sudo usermod -a -G video $LOGNAME sudo usermod -a -G render $LOGNAME sudo apt-get update wget https://repo.radeon.com/amdgpu-install/21.40.2/ubuntu/focal/amdgpu-install_21.40.2.40502-1_all.deb sudo apt-get install ./amdgpu-install_21.40.2.40502-1_all.deb sudo apt-get update sudo amdgpu-install --usecase=rocm sudo reboot
This time, both
rocminfo
andclinfo
don't output the expected results/opt/rocm-4.5.2/bin/rocminfo ROCk module is NOT loaded, possibly no GPU devices /opt/rocm-4.5.2/opencl/bin/clinfo Number of platforms: 1 Platform Profile: FULL_PROFILE Platform Version: OpenCL 2.2 AMD-APP (3361.0) Platform Name: AMD Accelerated Parallel Processing Platform Vendor: Advanced Micro Devices, Inc. Platform Extensions: cl_khr_icd cl_amd_event_callback Platform Name: AMD Accelerated Parallel Processing Number of devices: 0
dmesg output is attached for your reference, many thanks. dmesg.txt
@ROCmSupport Could someone help to take a look at dmesg output? thanks!
Hi @LeonSpark I have gone through the dmesg and found that its CPU soft lockup. watchdog: BUG: soft lockup - CPU#2 stuck for 40s! [swapper/0:1] As per my experience, this is not a common issue. This issue is specific to your config only. Now not seen in my configs and anywhere else. I do not think its due to ROCm. I too have seen this kind of problem once/twice in one of the specific machines long back(an year ago) and the issue is gone automatically after some days.
Hi @LeonSpark I hope this issue is fixed now, recommend to check with the latest ROCm 5.1 and update. Thank you.
Closing the ticket as it has become stale. @LeonSpark, please open another ticket for any new issue. Thanks.
Problem:
neither
/opt/rocm-4.5.0/bin/rocminfo
nor/opt/rocm-4.5.0/opencl/bin/clinfo
detect my MI25 GPU clinfo.txt rocminfo.txtEnvironment
Azure VM SKU: Standard NV16as v4
Linux Distribution Information:
Kernel Information
GPU
Amd kernel driver is loaded
@ROCmSupport I follow the ROCm installation guide for 4.5 but it turns out MI25 GPU is not detected by ROCm. I'm experimenting on Azure VM with AMD MI25 GPU, different from other threads, this VM has another Hyper-V compatible VGA and there is no official driver provided by Microsoft on Linux platforms. Could you please help to point out where I was wrong, thanks a lot!