No OpenCL Devices are detected

alfredopalhares commented 3 years ago

Hello, I have 4 AMD 480

01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] (rev e7)
05:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] (rev e7)
08:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] (rev e7)
09:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] (rev e7)

On Ubuntu 20.04 all update, followed the install intructions But cliinfo does not detect any cards:

/opt/rocm/opencl/bin/clinfo

Number of platforms:                             1
  Platform Profile:                              FULL_PROFILE
  Platform Version:                              OpenCL 2.0 AMD-APP (3212.0)
  Platform Name:                                 AMD Accelerated Parallel Processing
  Platform Vendor:                               Advanced Micro Devices, Inc.
  Platform Extensions:                           cl_khr_icd cl_amd_event_callback

  Platform Name:                                 AMD Accelerated Parallel Processing
Number of devices:                               0

No OpenCL applications work, no devices are detected. Can you help me diagnose this problem ?

johnbridgman commented 3 years ago

Can you please check which kernel version you are running (eg output of "uname -a") ? Canonical pushed a kernel update out to some 20.04.1 users bumping it to 5.8, and the current DKMS kernel will not install on 5.8.

alfredopalhares commented 3 years ago

Thank youf for you repl.

$ uname -a
Linux myhost 5.4.0-65-generic #73-Ubuntu SMP Mon Jan 18 17:25:17 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

So its a 5.4 Kernel. Also here are my amd modules:

$ lsmod | grep amd
amdgpu               5824512  0
amd_iommu_v2           20480  1 amdgpu
amd_sched              32768  1 amdgpu
amdttm                102400  1 amdgpu
amdkcl                 24576  2 amdttm,amdgpu
i2c_algo_bit           16384  2 amdgpu,i915
drm_kms_helper        184320  2 amdgpu,i915
drm                   491520  7 drm_kms_helper,amd_sched,amdttm,amdgpu,i915,amdkcl

Thank you for your help so far.

alfredopalhares commented 3 years ago

Hello? Any ideas how I can debug this further ?

fxkamd commented 3 years ago

There are some environment variables that enable more debug output that can help diagnose the problem. I can never remember the one in OpenCL. This one is more low-level: HSAKMT_DEBUG_LEVEL=7

alfredopalhares commented 3 years ago

@fxkamd thank you for your help, what command are your refering to?

Cliinfo has exactly the same output:

HSAKMT_DEBUG_LEVEL=7 /opt/rocm/opencl/bin/clinfo
Profiling of privileged blocks is not available.
Number of platforms:                             1
  Platform Profile:                              FULL_PROFILE
  Platform Version:                              OpenCL 2.0 AMD-APP (3212.0)
  Platform Name:                                 AMD Accelerated Parallel Processing
  Platform Vendor:                               Advanced Micro Devices, Inc.
  Platform Extensions:                           cl_khr_icd cl_amd_event_callback

  Platform Name:                                 AMD Accelerated Parallel Processing
Number of devices:                               0

alfredopalhares commented 3 years ago

Here is the rocminfo output:

sudo HSAKMT_DEBUG_LEVEL=7 /opt/rocm/bin/rocminfo
ROCk module is loaded
Able to open /dev/kfd read-write
=====================
HSA System Attributes
=====================
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE
System Endianness:       LITTLE

==========
HSA Agents
==========
*******
Agent 1
*******
  Name:                    Intel(R) Celeron(R) CPU G3900 @ 2.80GHz
  Uuid:                    CPU-XX
  Marketing Name:          Intel(R) Celeron(R) CPU G3900 @ 2.80GHz
  Vendor Name:             CPU
  Feature:                 None specified
  Profile:                 FULL_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        0(0x0)
  Queue Min Size:          0(0x0)
  Queue Max Size:          0(0x0)
  Queue Type:              MULTI
  Node:                    0
  Device Type:             CPU
  Cache Info:
    L1:                      32768(0x8000) KB
  Chip ID:                 0(0x0)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   2800
  BDFID:                   0
  Internal Node ID:        0
  Compute Unit:            2
  SIMDs per CU:            0
  Shader Engines:          0
  Shader Arrs. per Eng.:   0
  WatchPts on Addr. Ranges:1
  Features:                None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    7828164(0x7772c4) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
    Pool 2
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    7828164(0x7772c4) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
  ISA Info:
    N/A
*** Done ***

fxkamd commented 3 years ago

That's very strange. This should have produced some output on stderr. If not, it means it's not even getting to the low-level Thunk functions and failing somewhere higher up the stack.

Or you could check one level lower to see what KFD is reporting: ls -lR /sys/class/kfd/kfd/topology

Also check your kernel log with dmesg.

alfredopalhares commented 3 years ago

@fxkamd Thank you, here are the outputs

ls:

ls -lR /sys/class/kfd/kfd/topology
/sys/class/kfd/kfd/topology:
total 0
-r--r--r-- 1 root root 4096 Feb  8 16:29 generation_id
drwxr-xr-x 3 root root    0 Feb  8 16:29 nodes
-r--r--r-- 1 root root 4096 Feb  8 16:29 system_properties

/sys/class/kfd/kfd/topology/nodes:
total 0
drwxr-xr-x 6 root root 0 Feb  8 16:29 0

/sys/class/kfd/kfd/topology/nodes/0:
total 0
drwxr-xr-x 2 root root    0 Feb  9 12:01 caches
-r--r--r-- 1 root root 4096 Feb  8 16:29 gpu_id
drwxr-xr-x 2 root root    0 Feb  9 12:01 io_links
drwxr-xr-x 3 root root    0 Feb  8 16:29 mem_banks
-r--r--r-- 1 root root 4096 Feb  9 12:01 name
drwxr-xr-x 2 root root    0 Feb  9 12:01 perf
-r--r--r-- 1 root root 4096 Feb  8 16:29 properties

/sys/class/kfd/kfd/topology/nodes/0/caches:
total 0

/sys/class/kfd/kfd/topology/nodes/0/io_links:
total 0

/sys/class/kfd/kfd/topology/nodes/0/mem_banks:
total 0
drwxr-xr-x 2 root root 0 Feb  8 16:29 0

/sys/class/kfd/kfd/topology/nodes/0/mem_banks/0:
total 0
-r--r--r-- 1 root root 4096 Feb  8 16:29 properties

/sys/class/kfd/kfd/topology/nodes/0/perf:
total 0

ls:

dmesg | grep -i amdgpu
[    2.734194] [drm] amdgpu kernel modesetting enabled.
[    2.734198] [drm] amdgpu version: 5.6.19
[    2.734261] amdgpu: CRAT table not found
[    2.734263] amdgpu: Virtual CRAT table created for CPU
[    2.734278] amdgpu: Topology: Add CPU node
[    2.736844] amdgpu 0000:01:00.0: remove_conflicting_pci_framebuffers: bar 0: 0x2fe0000000 -> 0x2fefffffff
[    2.736850] amdgpu 0000:01:00.0: remove_conflicting_pci_framebuffers: bar 2: 0x2ff0000000 -> 0x2ff01fffff
[    2.736852] amdgpu 0000:01:00.0: remove_conflicting_pci_framebuffers: bar 5: 0xf7300000 -> 0xf733ffff
[    2.736873] amdgpu 0000:01:00.0: enabling device (0000 -> 0003)
[    2.737008] amdgpu 0000:01:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported
[    2.737010] amdgpu 0000:01:00.0: amdgpu: set kernel compute queue number to 8 due to invalid parameter provided by user
[    2.737052] kfd kfd: amdgpu: skipped device 1002:67df, PCI rejects atomics
[    2.978360] amdgpu 0000:01:00.0: amdgpu: Fetched VBIOS from ROM BAR
[    2.978365] amdgpu: ATOM BIOS: 113-BE366EU-Z46
[    3.108250] amdgpu 0000:01:00.0: BAR 2: releasing [mem 0x2ff0000000-0x2ff01fffff 64bit pref]
[    3.108254] amdgpu 0000:01:00.0: BAR 0: releasing [mem 0x2fe0000000-0x2fefffffff 64bit pref]
[    3.108289] amdgpu 0000:01:00.0: BAR 0: assigned [mem 0x2000000000-0x21ffffffff 64bit pref]
[    3.108300] amdgpu 0000:01:00.0: BAR 2: assigned [mem 0x2200000000-0x22001fffff 64bit pref]
[    3.108337] amdgpu 0000:01:00.0: amdgpu: VRAM: 8192M 0x000000F400000000 - 0x000000F5FFFFFFFF (8192M used)
[    3.108340] amdgpu 0000:01:00.0: amdgpu: GART: 256M 0x000000FF00000000 - 0x000000FF0FFFFFFF
[    3.108512] [drm] amdgpu: 8192M of VRAM memory ready
[    3.108517] [drm] amdgpu: 8192M of GTT memory ready.
[    3.110309] amdgpu: [powerplay] hwmgr_sw_init smu backed is polaris10_smu
[    3.313169] amdgpu 0000:01:00.0: amdgpu: SE 4, SH per SE 1, CU per SH 9, active_cu_number 36
[    3.317160] [drm] Initialized amdgpu 3.40.0 20150101 for 0000:01:00.0 on minor 1
[    3.317240] amdgpu 0000:05:00.0: remove_conflicting_pci_framebuffers: bar 0: 0x2fc0000000 -> 0x2fcfffffff
[    3.317243] amdgpu 0000:05:00.0: remove_conflicting_pci_framebuffers: bar 2: 0x2fd0000000 -> 0x2fd01fffff
[    3.317245] amdgpu 0000:05:00.0: remove_conflicting_pci_framebuffers: bar 5: 0xf7200000 -> 0xf723ffff
[    3.317266] amdgpu 0000:05:00.0: enabling device (0000 -> 0003)
[    3.317371] amdgpu 0000:05:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported
[    3.317421] kfd kfd: amdgpu: skipped device 1002:67df, PCI rejects atomics
[    3.573530] amdgpu 0000:05:00.0: amdgpu: Fetched VBIOS from ROM BAR
[    3.573534] amdgpu: ATOM BIOS: 113-BE366EU-Z46
[    3.708367] amdgpu 0000:05:00.0: BAR 2: releasing [mem 0x2fd0000000-0x2fd01fffff 64bit pref]
[    3.708370] amdgpu 0000:05:00.0: BAR 0: releasing [mem 0x2fc0000000-0x2fcfffffff 64bit pref]
[    3.708421] amdgpu 0000:05:00.0: BAR 0: no space for [mem size 0x200000000 64bit pref]
[    3.708424] amdgpu 0000:05:00.0: BAR 0: failed to assign [mem size 0x200000000 64bit pref]
[    3.708427] amdgpu 0000:05:00.0: BAR 2: no space for [mem size 0x00200000 64bit pref]
[    3.708429] amdgpu 0000:05:00.0: BAR 2: failed to assign [mem size 0x00200000 64bit pref]
[    3.708486] amdgpu 0000:05:00.0: BAR 0: assigned [mem 0x2fc0000000-0x2fcfffffff 64bit pref]
[    3.708502] amdgpu 0000:05:00.0: BAR 2: assigned [mem 0x2fd0000000-0x2fd01fffff 64bit pref]
[    3.708529] amdgpu 0000:05:00.0: amdgpu: VRAM: 8192M 0x000000F400000000 - 0x000000F5FFFFFFFF (8192M used)
[    3.708531] amdgpu 0000:05:00.0: amdgpu: GART: 256M 0x000000FF00000000 - 0x000000FF0FFFFFFF
[    3.708563] [drm] amdgpu: 8192M of VRAM memory ready
[    3.708566] [drm] amdgpu: 8192M of GTT memory ready.
[    3.710458] amdgpu: [powerplay] hwmgr_sw_init smu backed is polaris10_smu
[    3.908544] amdgpu 0000:05:00.0: amdgpu: SE 4, SH per SE 1, CU per SH 9, active_cu_number 36
[    3.912299] [drm] Initialized amdgpu 3.40.0 20150101 for 0000:05:00.0 on minor 2
[    3.912378] amdgpu 0000:08:00.0: remove_conflicting_pci_framebuffers: bar 0: 0x2fa0000000 -> 0x2fafffffff
[    3.912381] amdgpu 0000:08:00.0: remove_conflicting_pci_framebuffers: bar 2: 0x2fb0000000 -> 0x2fb01fffff
[    3.912384] amdgpu 0000:08:00.0: remove_conflicting_pci_framebuffers: bar 5: 0xf7100000 -> 0xf713ffff
[    3.912406] amdgpu 0000:08:00.0: enabling device (0000 -> 0003)
[    3.912513] amdgpu 0000:08:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported
[    3.912562] kfd kfd: amdgpu: skipped device 1002:67df, PCI rejects atomics
[    4.174650] amdgpu 0000:08:00.0: amdgpu: Fetched VBIOS from ROM BAR
[    4.174654] amdgpu: ATOM BIOS: 113-BE366EU-Z46
[    4.304406] amdgpu 0000:08:00.0: BAR 2: releasing [mem 0x2fb0000000-0x2fb01fffff 64bit pref]
[    4.304409] amdgpu 0000:08:00.0: BAR 0: releasing [mem 0x2fa0000000-0x2fafffffff 64bit pref]
[    4.304464] amdgpu 0000:08:00.0: BAR 0: no space for [mem size 0x200000000 64bit pref]
[    4.304466] amdgpu 0000:08:00.0: BAR 0: failed to assign [mem size 0x200000000 64bit pref]
[    4.304469] amdgpu 0000:08:00.0: BAR 2: no space for [mem size 0x00200000 64bit pref]
[    4.304471] amdgpu 0000:08:00.0: BAR 2: failed to assign [mem size 0x00200000 64bit pref]
[    4.304529] amdgpu 0000:08:00.0: BAR 0: assigned [mem 0x2fa0000000-0x2fafffffff 64bit pref]
[    4.304547] amdgpu 0000:08:00.0: BAR 2: assigned [mem 0x2fb0000000-0x2fb01fffff 64bit pref]
[    4.304572] amdgpu 0000:08:00.0: amdgpu: VRAM: 8192M 0x000000F400000000 - 0x000000F5FFFFFFFF (8192M used)
[    4.304574] amdgpu 0000:08:00.0: amdgpu: GART: 256M 0x000000FF00000000 - 0x000000FF0FFFFFFF
[    4.304601] [drm] amdgpu: 8192M of VRAM memory ready
[    4.304605] [drm] amdgpu: 8192M of GTT memory ready.
[    4.306839] amdgpu: [powerplay] hwmgr_sw_init smu backed is polaris10_smu
[    4.507827] amdgpu 0000:08:00.0: amdgpu: SE 4, SH per SE 1, CU per SH 9, active_cu_number 36
[    4.511415] [drm] Initialized amdgpu 3.40.0 20150101 for 0000:08:00.0 on minor 3
[    4.511491] amdgpu 0000:09:00.0: remove_conflicting_pci_framebuffers: bar 0: 0x2f80000000 -> 0x2f8fffffff
[    4.511494] amdgpu 0000:09:00.0: remove_conflicting_pci_framebuffers: bar 2: 0x2f90000000 -> 0x2f901fffff
[    4.511497] amdgpu 0000:09:00.0: remove_conflicting_pci_framebuffers: bar 5: 0xf7000000 -> 0xf703ffff
[    4.511520] amdgpu 0000:09:00.0: enabling device (0000 -> 0003)
[    4.511626] amdgpu 0000:09:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported
[    4.511673] kfd kfd: amdgpu: skipped device 1002:67df, PCI rejects atomics
[    4.767793] amdgpu 0000:09:00.0: amdgpu: Fetched VBIOS from ROM BAR
[    4.767796] amdgpu: ATOM BIOS: 113-BE366EU-Z46
[    4.900318] amdgpu 0000:09:00.0: BAR 2: releasing [mem 0x2f90000000-0x2f901fffff 64bit pref]
[    4.900321] amdgpu 0000:09:00.0: BAR 0: releasing [mem 0x2f80000000-0x2f8fffffff 64bit pref]
[    4.900373] amdgpu 0000:09:00.0: BAR 0: no space for [mem size 0x200000000 64bit pref]
[    4.900375] amdgpu 0000:09:00.0: BAR 0: failed to assign [mem size 0x200000000 64bit pref]
[    4.900378] amdgpu 0000:09:00.0: BAR 2: no space for [mem size 0x00200000 64bit pref]
[    4.900380] amdgpu 0000:09:00.0: BAR 2: failed to assign [mem size 0x00200000 64bit pref]
[    4.900438] amdgpu 0000:09:00.0: BAR 0: assigned [mem 0x2f80000000-0x2f8fffffff 64bit pref]
[    4.900454] amdgpu 0000:09:00.0: BAR 2: assigned [mem 0x2f90000000-0x2f901fffff 64bit pref]
[    4.900477] amdgpu 0000:09:00.0: amdgpu: VRAM: 8192M 0x000000F400000000 - 0x000000F5FFFFFFFF (8192M used)
[    4.900480] amdgpu 0000:09:00.0: amdgpu: GART: 256M 0x000000FF00000000 - 0x000000FF0FFFFFFF
[    4.900507] [drm] amdgpu: 8192M of VRAM memory ready
[    4.900510] [drm] amdgpu: 8192M of GTT memory ready.
[    4.902337] amdgpu: [powerplay] hwmgr_sw_init smu backed is polaris10_smu
[    5.100284] amdgpu 0000:09:00.0: amdgpu: SE 4, SH per SE 1, CU per SH 9, active_cu_number 36
[    5.104236] [drm] Initialized amdgpu 3.40.0 20150101 for 0000:09:00.0 on minor 4
[    8.234236] snd_hda_intel 0000:01:00.1: bound 0000:01:00.0 (ops amdgpu_dm_audio_component_bind_ops [amdgpu])
[    8.235903] snd_hda_intel 0000:05:00.1: bound 0000:05:00.0 (ops amdgpu_dm_audio_component_bind_ops [amdgpu])
[    8.238932] snd_hda_intel 0000:08:00.1: bound 0000:08:00.0 (ops amdgpu_dm_audio_component_bind_ops [amdgpu])
[    8.241749] snd_hda_intel 0000:09:00.1: bound 0000:09:00.0 (ops amdgpu_dm_audio_component_bind_ops [amdgpu])

fxkamd commented 3 years ago

Here is your problem: [ 3.317421] kfd kfd: amdgpu: skipped device 1002:67df, PCI rejects atomics

KFD (more specifically the AQL firmware) on your Ellesmere GPUs requires PCIe v3 with support for atomic operations. Your mainboard doesn't seem to support that. On some mainboards, some slots support PCIe v3, while others do not. You could try experimenting with that. But given the age of your CPU, your chipset may just not have PCIe v3 support at all.

alfredopalhares commented 3 years ago

I did tryied on every slot. My board is an ASRock H110 Pro BTC +

Even amdpro drivers do not work.

fxkamd commented 3 years ago

I looked up the spec or your main board: https://www.asrock.com/mb/Intel/H110%20Pro%20BTC+/index.asp "1 PCIe 3.0 x16, 12 PCIe 2.0 x1". Only the single x16 slot supports PCIe 3. The remaining 12 x1 slots only support PCIe 2. If you plug in one of the Ellesmere cards into the x16 slots, it should work with ROCm. Unfortunately the x1 slots on your board are not suitable for ROCm.

fxkamd commented 3 years ago

The AMD Linux Pro driver should work though. As far as I know, it uses a legacy OpenCL implementation that does not depend on ROCm. In the 20.45 release it moved to a ROCm-based OpenCL implementation, but only for Vega and later GPUs.

avimanyu786 commented 3 years ago

Hello, I have 4 AMD 480

01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] (rev e7)
05:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] (rev e7)
08:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] (rev e7)
09:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] (rev e7)

On Ubuntu 20.04 all update, followed the install intructions But cliinfo does not detect any cards:

/opt/rocm/opencl/bin/clinfo

Number of platforms:                             1
  Platform Profile:                              FULL_PROFILE
  Platform Version:                              OpenCL 2.0 AMD-APP (3212.0)
  Platform Name:                                 AMD Accelerated Parallel Processing
  Platform Vendor:                               Advanced Micro Devices, Inc.
  Platform Extensions:                           cl_khr_icd cl_amd_event_callback

  Platform Name:                                 AMD Accelerated Parallel Processing
Number of devices:                               0

No OpenCL applications work, no devices are detected. Can you help me diagnose this problem ?

Hi @alfredopalhares,

I was having the same issue and the mesa-opencl-icd package fixed it for me. I'm using ROCm 4.2 on Ubuntu 18.04.5 LTS. The package seems to be available for 20.04 as well.

sudo apt install mesa-opencl-icd

Hope it works out for you. All the best!

avimanyu786 commented 3 years ago

Hello, I have 4 AMD 480

01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] (rev e7)
05:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] (rev e7)
08:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] (rev e7)
09:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] (rev e7)

On Ubuntu 20.04 all update, followed the install intructions But cliinfo does not detect any cards:

/opt/rocm/opencl/bin/clinfo

Number of platforms:                             1
  Platform Profile:                              FULL_PROFILE
  Platform Version:                              OpenCL 2.0 AMD-APP (3212.0)
  Platform Name:                                 AMD Accelerated Parallel Processing
  Platform Vendor:                               Advanced Micro Devices, Inc.
  Platform Extensions:                           cl_khr_icd cl_amd_event_callback

  Platform Name:                                 AMD Accelerated Parallel Processing
Number of devices:                               0

No OpenCL applications work, no devices are detected. Can you help me diagnose this problem ?

Hi @alfredopalhares,

I was having the same issue and the mesa-opencl-icd package fixed it for me. I'm using ROCm 4.2 on Ubuntu 18.04.5 LTS. The package seems to be available for 20.04 as well.

sudo apt install mesa-opencl-icd

Hope it works out for you. All the best!

Further investigated this. The command is referring to the mesa version(mesa.icd) located at /etc/OpenCL/vendors/. That isn't supposed to happen. I've done a clean install of ROCm 4.2. https://github.com/RadeonOpenCompute/ROCm/issues/511 seems to be a good reference to narrow out a solution.

terU3760 commented 3 years ago

@all , I have a similar problem. When I ran: /opt/rocm/opencl/bin/clinfo I got the output:

Number of platforms:                 1
  Platform Profile:              FULL_PROFILE
  Platform Version:              OpenCL 2.0 AMD-APP (3275.0)
  Platform Name:                 AMD Accelerated Parallel Processing
  Platform Vendor:               Advanced Micro Devices, Inc.
  Platform Extensions:               cl_khr_icd cl_amd_event_callback 

  Platform Name:                 AMD Accelerated Parallel Processing
Number of devices:               0

When ran: sudo /opt/rocm/bin/rocminfo Got the output:

ROCk module is loaded
HSA Error:  Incompatible kernel and userspace, Vega 20 disabled. Upgrade amdgpu.
HSA Error:  Incompatible kernel and userspace, Vega 20 disabled. Upgrade amdgpu.
=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    AMD Ryzen 7 2700X Eight-Core Processor
  Uuid:                    CPU-XX                             
  Marketing Name:          AMD Ryzen 7 2700X Eight-Core Processor
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   3700                               
  BDFID:                   0                                  
  Internal Node ID:        0                                  
  Compute Unit:            16                                 
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    32871168(0x1f59300) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    32871168(0x1f59300) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
    N/A                      
*** Done ***

But factually there are two AMD gpus on my computer. (When ran: lsmod | grep amd Got output:

edac_mce_amd           32768  0
amdgpu               4579328  41
amd_iommu_v2           20480  1 amdgpu
gpu_sched              32768  1 amdgpu
ttm                   106496  1 amdgpu
drm_kms_helper        180224  1 amdgpu
drm                   487424  22 gpu_sched,drm_kms_helper,amdgpu,ttm
i2c_algo_bit           16384  2 igb,amdgpu
gpio_amdpt             20480  0
gpio_generic           20480  1 gpio_amdpt

When ran: dmesg | grep -i amdgpu Got output:

[    1.196753] [drm] amdgpu kernel modesetting enabled.
[    1.196907] amdgpu 0000:0a:00.0: remove_conflicting_pci_framebuffers: bar 0: 0xe0000000 -> 0xefffffff
[    1.196908] amdgpu 0000:0a:00.0: remove_conflicting_pci_framebuffers: bar 2: 0xf0000000 -> 0xf01fffff
[    1.196909] amdgpu 0000:0a:00.0: remove_conflicting_pci_framebuffers: bar 5: 0xfcd00000 -> 0xfcd7ffff
[    1.196911] fb0: switching to amdgpudrmfb from VESA VGA
[    1.196987] amdgpu 0000:0a:00.0: vgaarb: deactivate vga console
[    1.197186] amdgpu 0000:0a:00.0: No more image in the PCI ROM
[    1.197265] amdgpu 0000:0a:00.0: VRAM: 16368M 0x0000008000000000 - 0x00000083FEFFFFFF (16368M used)
[    1.197266] amdgpu 0000:0a:00.0: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
[    1.197267] amdgpu 0000:0a:00.0: AGP: 267894784M 0x0000008400000000 - 0x0000FFFFFFFFFFFF
[    1.197342] [drm] amdgpu: 16368M of VRAM memory ready
[    1.197344] [drm] amdgpu: 16368M of GTT memory ready.
[    1.197652] amdgpu 0000:0a:00.0: Direct firmware load for amdgpu/vega20_ta.bin failed with error -2
[    1.197653] amdgpu 0000:0a:00.0: psp v11.0: Failed to load firmware "amdgpu/vega20_ta.bin"
[    1.199147] amdgpu: [powerplay] hwmgr_sw_init smu backed is vega20_smu
[    2.105430] fbcon: amdgpudrmfb (fb0) is primary device
[    2.127315] amdgpu 0000:0a:00.0: fb0: amdgpudrmfb frame buffer device
[    2.133059] amdgpu 0000:0a:00.0: ring gfx uses VM inv eng 0 on hub 0
[    2.133060] amdgpu 0000:0a:00.0: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[    2.133061] amdgpu 0000:0a:00.0: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[    2.133061] amdgpu 0000:0a:00.0: ring comp_1.2.0 uses VM inv eng 5 on hub 0
[    2.133062] amdgpu 0000:0a:00.0: ring comp_1.3.0 uses VM inv eng 6 on hub 0
[    2.133063] amdgpu 0000:0a:00.0: ring comp_1.0.1 uses VM inv eng 7 on hub 0
[    2.133063] amdgpu 0000:0a:00.0: ring comp_1.1.1 uses VM inv eng 8 on hub 0
[    2.133064] amdgpu 0000:0a:00.0: ring comp_1.2.1 uses VM inv eng 9 on hub 0
[    2.133065] amdgpu 0000:0a:00.0: ring comp_1.3.1 uses VM inv eng 10 on hub 0
[    2.133065] amdgpu 0000:0a:00.0: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
[    2.133066] amdgpu 0000:0a:00.0: ring sdma0 uses VM inv eng 0 on hub 1
[    2.133067] amdgpu 0000:0a:00.0: ring page0 uses VM inv eng 1 on hub 1
[    2.133067] amdgpu 0000:0a:00.0: ring sdma1 uses VM inv eng 4 on hub 1
[    2.133068] amdgpu 0000:0a:00.0: ring page1 uses VM inv eng 5 on hub 1
[    2.133068] amdgpu 0000:0a:00.0: ring uvd_0 uses VM inv eng 6 on hub 1
[    2.133069] amdgpu 0000:0a:00.0: ring uvd_enc_0.0 uses VM inv eng 7 on hub 1
[    2.133069] amdgpu 0000:0a:00.0: ring uvd_enc_0.1 uses VM inv eng 8 on hub 1
[    2.133070] amdgpu 0000:0a:00.0: ring uvd_1 uses VM inv eng 9 on hub 1
[    2.133071] amdgpu 0000:0a:00.0: ring uvd_enc_1.0 uses VM inv eng 10 on hub 1
[    2.133071] amdgpu 0000:0a:00.0: ring uvd_enc_1.1 uses VM inv eng 11 on hub 1
[    2.133072] amdgpu 0000:0a:00.0: ring vce0 uses VM inv eng 12 on hub 1
[    2.133072] amdgpu 0000:0a:00.0: ring vce1 uses VM inv eng 13 on hub 1
[    2.133073] amdgpu 0000:0a:00.0: ring vce2 uses VM inv eng 14 on hub 1
[    2.133423] Detected AMDGPU DF Counters. # of Counters = 4.
[    2.133440] [drm] Initialized amdgpu 3.35.0 20150101 for 0000:0a:00.0 on minor 0
[    2.133457] amdgpu 0000:0d:00.0: remove_conflicting_pci_framebuffers: bar 0: 0xc0000000 -> 0xcfffffff
[    2.133457] amdgpu 0000:0d:00.0: remove_conflicting_pci_framebuffers: bar 2: 0xd0000000 -> 0xd01fffff
[    2.133458] amdgpu 0000:0d:00.0: remove_conflicting_pci_framebuffers: bar 5: 0xfcb00000 -> 0xfcb7ffff
[    2.133467] amdgpu 0000:0d:00.0: enabling device (0000 -> 0003)
[    2.232072] amdgpu 0000:0d:00.0: VRAM: 16368M 0x0000008000000000 - 0x00000083FEFFFFFF (16368M used)
[    2.232073] amdgpu 0000:0d:00.0: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
[    2.232074] amdgpu 0000:0d:00.0: AGP: 267894784M 0x0000008400000000 - 0x0000FFFFFFFFFFFF
[    2.232089] [drm] amdgpu: 16368M of VRAM memory ready
[    2.232090] [drm] amdgpu: 16368M of GTT memory ready.
[    2.232323] amdgpu 0000:0d:00.0: Direct firmware load for amdgpu/vega20_ta.bin failed with error -2
[    2.232324] amdgpu 0000:0d:00.0: psp v11.0: Failed to load firmware "amdgpu/vega20_ta.bin"
[    2.233892] amdgpu: [powerplay] hwmgr_sw_init smu backed is vega20_smu
[    3.034037] amdgpu 0000:0d:00.0: ring gfx uses VM inv eng 0 on hub 0
[    3.034038] amdgpu 0000:0d:00.0: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[    3.034038] amdgpu 0000:0d:00.0: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[    3.034039] amdgpu 0000:0d:00.0: ring comp_1.2.0 uses VM inv eng 5 on hub 0
[    3.034040] amdgpu 0000:0d:00.0: ring comp_1.3.0 uses VM inv eng 6 on hub 0
[    3.034040] amdgpu 0000:0d:00.0: ring comp_1.0.1 uses VM inv eng 7 on hub 0
[    3.034041] amdgpu 0000:0d:00.0: ring comp_1.1.1 uses VM inv eng 8 on hub 0
[    3.034041] amdgpu 0000:0d:00.0: ring comp_1.2.1 uses VM inv eng 9 on hub 0
[    3.034042] amdgpu 0000:0d:00.0: ring comp_1.3.1 uses VM inv eng 10 on hub 0
[    3.034042] amdgpu 0000:0d:00.0: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
[    3.034043] amdgpu 0000:0d:00.0: ring sdma0 uses VM inv eng 0 on hub 1
[    3.034044] amdgpu 0000:0d:00.0: ring page0 uses VM inv eng 1 on hub 1
[    3.034045] amdgpu 0000:0d:00.0: ring sdma1 uses VM inv eng 4 on hub 1
[    3.034045] amdgpu 0000:0d:00.0: ring page1 uses VM inv eng 5 on hub 1
[    3.034046] amdgpu 0000:0d:00.0: ring uvd_0 uses VM inv eng 6 on hub 1
[    3.034046] amdgpu 0000:0d:00.0: ring uvd_enc_0.0 uses VM inv eng 7 on hub 1
[    3.034047] amdgpu 0000:0d:00.0: ring uvd_enc_0.1 uses VM inv eng 8 on hub 1
[    3.034048] amdgpu 0000:0d:00.0: ring uvd_1 uses VM inv eng 9 on hub 1
[    3.034048] amdgpu 0000:0d:00.0: ring uvd_enc_1.0 uses VM inv eng 10 on hub 1
[    3.034049] amdgpu 0000:0d:00.0: ring uvd_enc_1.1 uses VM inv eng 11 on hub 1
[    3.034050] amdgpu 0000:0d:00.0: ring vce0 uses VM inv eng 12 on hub 1
[    3.034051] amdgpu 0000:0d:00.0: ring vce1 uses VM inv eng 13 on hub 1
[    3.034051] amdgpu 0000:0d:00.0: ring vce2 uses VM inv eng 14 on hub 1
[    3.034550] Detected AMDGPU DF Counters. # of Counters = 4.
[    3.034567] [drm] Initialized amdgpu 3.35.0 20150101 for 0000:0d:00.0 on minor 1
[   11.286234] snd_hda_intel 0000:0d:00.1: bound 0000:0d:00.0 (ops amdgpu_dm_audio_component_bind_ops [amdgpu])
[   11.287645] snd_hda_intel 0000:0a:00.1: bound 0000:0a:00.0 (ops amdgpu_dm_audio_component_bind_ops [amdgpu])

What could be the cause and how to solve it?

ca5ua1 commented 3 years ago

I did tryied on every slot. My board is an ASRock H110 Pro BTC +

I have exactly same situation with all driver versions.

Dantali0n commented 3 years ago

I have the same problem, since 4.5 my gpu is no longer detected with OpenCL

PatrickMSM commented 2 years ago

Same issue here. I am using an RX 580 on Ubuntu 20.04 This is my clinfo output:

Number of platforms:                 1
  Platform Profile:              FULL_PROFILE
  Platform Version:              OpenCL 2.2 AMD-APP (3361.0)
  Platform Name:                 AMD Accelerated Parallel Processing
  Platform Vendor:               Advanced Micro Devices, Inc.
  Platform Extensions:               cl_khr_icd cl_amd_event_callback 

  Platform Name:                 AMD Accelerated Parallel Processing
Number of devices:               0

And this is my rocminfo output:

ROCk module is loaded
hsa api call failure at: /long_pathname_so_that_rpms_can_package_the_debug_info/src/rocminfo/rocminfo.cc:1143
Call returned HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events.

I installed the drivers using this: https://rocmdocs.amd.com/en/latest/Installation_Guide/Installation-Guide.html#installing-a-rocm-package-from-a-debian-repository

JStrbg commented 2 years ago

I have the same issue, on rocm 4.5.2, as well as all the earlier versions i've tried:

Number of platforms:                 1
  Platform Profile:              FULL_PROFILE
  Platform Version:              OpenCL 2.2 AMD-APP (3361.0)
  Platform Name:                 AMD Accelerated Parallel Processing
  Platform Vendor:               Advanced Micro Devices, Inc.
  Platform Extensions:               cl_khr_icd cl_amd_event_callback 

  Platform Name:                 AMD Accelerated Parallel Processing
Number of devices:               0

Motherboard is set to use PCIe 3.0 on the slot of the 5700xt gpu. Motherboard is P8Z77-V DELUXE.

From dmesg: kfd kfd: amdgpu: skipped device 1102:731f, PCI rejects atomics 142<145

myhrmans commented 1 year ago

Same issue here

uname -a

Linux test-7 5.15.85-1-pve #1 SMP PVE 5.15.85-1 
(2023-02-01T00:00Z) x86_64 x86_64 x86_64 GNU/Linux

lsb_release -a

No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 20.04 LTS
Release:        20.04
Codename:       focal

clinfo

Number of platforms                               1
  Platform Name                                   AMD Accelerated Parallel Processing
  Platform Vendor                                 Advanced Micro Devices, Inc.
  Platform Version                                OpenCL 2.1 AMD-APP (3513.0)
  Platform Profile                                FULL_PROFILE
  Platform Extensions                             cl_khr_icd cl_amd_event_callback 
  Platform Host timer resolution                  1ns
  Platform Extensions function suffix             AMD

  Platform Name                                   AMD Accelerated Parallel Processing
Number of devices                                 0

AndreV84 commented 1 year ago

Hello Is there any solution for 20.04 ubuntu kernel 5.15?

Number of platforms                               1

  Platform Name                                   AMD Accelerated Parallel Processing

  Platform Vendor                                 Advanced Micro Devices, Inc.

  Platform Version                                OpenCL 2.1 AMD-APP (3486.0)

  Platform Profile                                FULL_PROFILE

  Platform Extensions                             cl_khr_icd cl_amd_event_callback

  Platform Host timer resolution                  1ns

  Platform Extensions function suffix             AMD

  Platform Name                                   AMD Accelerated Parallel Processing

Number of devices                                 0

NULL platform behavior

  clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...)  No platform

  clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...)   No platform

  clCreateContext(NULL, ...) [default]            No platform

  clCreateContext(NULL, ...) [other]              No platform

  clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT)  No devices found in platform

  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU)  No devices found in platform

  clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU)  No devices found in platform

  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR)  No devices found in platform

  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM)  No devices found in platform

  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL)  No devices found in platform

----------------------------------------------------

tried installing amd drivers some installers work other result in error the last attempt which outputs are listed above were rendered with

sudo dpkg -i amdgpu-install_22.20.50200-1_all.deb

RadeonOpenCompute / ROCm_Documentation

No OpenCL Devices are detected #111