gfx1030 does not show up as OpenCL device

FluxusMagna commented 2 years ago

The card shows up in lshw and rocminfo, but clinfo shows 0 devices for the AMD platform. I previously had a discussion(https://github.com/rocm-arch/rocm-arch/discussions/768) with the arch-linux package maintainers(@acxz) and concluded that it seems to involve upstream code.

clinfo output:

``` $ /opt/rocm/bin/clinfo Number of platforms: 3 Platform Profile: FULL_PROFILE Platform Version: OpenCL 3.0 CUDA 11.6.127 Platform Name: NVIDIA CUDA Platform Vendor: NVIDIA Corporation Platform Extensions: cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts cl_khr_gl_event cl_nv_create_buffer cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_nv_kernel_attribute cl_khr_device_uuid cl_khr_pci_bus_info cl_khr_external_semaphore cl_khr_external_memory cl_khr_external_semaphore_opaque_fd cl_khr_external_memory_opaque_fd Platform Profile: FULL_PROFILE Platform Version: OpenCL 2.1 LINUX Platform Name: Intel(R) CPU Runtime for OpenCL(TM) Applications Platform Vendor: Intel(R) Corporation Platform Extensions: cl_khr_icd cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_depth_images cl_khr_3d_image_writes cl_intel_exec_by_local_thread cl_khr_spir cl_khr_fp64 cl_khr_image2d_from_buffer cl_intel_vec_len_hint Platform Profile: FULL_PROFILE Platform Version: OpenCL 2.1 AMD-APP (3423.0) Platform Name: AMD Accelerated Parallel Processing Platform Vendor: Advanced Micro Devices, Inc. Platform Extensions: cl_khr_icd cl_amd_event_callback Platform Name: NVIDIA CUDA Number of devices: 1 Device Type: CL_DEVICE_TYPE_GPU Vendor ID: 10deh Max compute units: 13 Max work items dimensions: 3 Max work items[0]: 1024 Max work items[1]: 1024 Max work items[2]: 64 Max work group size: 1024 Preferred vector width char: 1 Preferred vector width short: 1 Preferred vector width int: 1 Preferred vector width long: 1 Preferred vector width float: 1 Preferred vector width double: 1 Native vector width char: 1 Native vector width short: 1 Native vector width int: 1 Native vector width long: 1 Native vector width float: 1 Native vector width double: 1 Max clock frequency: 772Mhz Address bits: 64 Max memory allocation: 2128281600 Image support: Yes Max number of images read arguments: 256 Max number of images write arguments: 16 Max image 2D width: 16384 Max image 2D height: 16384 Max image 3D width: 4096 Max image 3D height: 4096 Max image 3D depth: 4096 Max samplers within kernel: 32 Max size of kernel argument: 4352 Alignment (bits) of base address: 4096 Minimum alignment (bytes) for any datatype: 128 Single precision floating point capability Denorms: Yes Quiet NaNs: Yes Round to nearest even: Yes Round to zero: Yes Round to +ve and infinity: Yes IEEE754-2008 fused multiply-add: Yes Cache type: Read/Write Cache line size: 128 Cache size: 638976 Global memory size: 8513126400 Constant buffer size: 65536 Max number of constant args: 9 Local memory type: Scratchpad Local memory size: 49152 Max pipe arguments: 0 Max pipe active reservations: 0 Max pipe packet size: 0 Max global variable size: 0 Max global variable preferred total size: 0 Max read/write image args: 0 Max on device events: 0 Queue on device max size: 0 Max on device queues: 0 Queue on device preferred size: 0 SVM capabilities: Coarse grain buffer: Yes Fine grain buffer: No Fine grain system: No Atomics: No Preferred platform atomic alignment: 0 Preferred global atomic alignment: 0 Preferred local atomic alignment: 0 Kernel Preferred work group size multiple: 32 Error correction support: 0 Unified memory for Host and Device: 0 Profiling timer resolution: 1000 Device endianess: Little Available: Yes Compiler available: Yes Execution capabilities: Execute OpenCL kernels: Yes Execute native function: No Queue on Host properties: Out-of-Order: Yes Profiling : Yes Queue on Device properties: Out-of-Order: No Profiling : No Platform ID: 0x562e4f9968c0 Name: Quadro M4000 Vendor: NVIDIA Corporation Device OpenCL C version: OpenCL C 1.2 Driver version: 510.60.02 Profile: FULL_PROFILE Version: OpenCL 3.0 CUDA Extensions: cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts cl_khr_gl_event cl_nv_create_buffer cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_nv_kernel_attribute cl_khr_device_uuid cl_khr_pci_bus_info cl_khr_external_semaphore cl_khr_external_memory cl_khr_external_semaphore_opaque_fd cl_khr_external_memory_opaque_fd Platform Name: Intel(R) CPU Runtime for OpenCL(TM) Applications Number of devices: 1 Device Type: CL_DEVICE_TYPE_CPU Vendor ID: 8086h Max compute units: 80 Max work items dimensions: 3 Max work items[0]: 8192 Max work items[1]: 8192 Max work items[2]: 8192 Max work group size: 8192 Preferred vector width char: 1 Preferred vector width short: 1 Preferred vector width int: 1 Preferred vector width long: 1 Preferred vector width float: 1 Preferred vector width double: 1 Native vector width char: 32 Native vector width short: 16 Native vector width int: 8 Native vector width long: 4 Native vector width float: 8 Native vector width double: 4 Max clock frequency: 2200Mhz Address bits: 64 Max memory allocation: 16858395648 Image support: Yes Max number of images read arguments: 480 Max number of images write arguments: 480 Max image 2D width: 16384 Max image 2D height: 16384 Max image 3D width: 2048 Max image 3D height: 2048 Max image 3D depth: 2048 Max samplers within kernel: 480 Max size of kernel argument: 3840 Alignment (bits) of base address: 1024 Minimum alignment (bytes) for any datatype: 128 Single precision floating point capability Denorms: Yes Quiet NaNs: Yes Round to nearest even: Yes Round to zero: No Round to +ve and infinity: No IEEE754-2008 fused multiply-add: No Cache type: Read/Write Cache line size: 64 Cache size: 262144 Global memory size: 67433582592 Constant buffer size: 131072 Max number of constant args: 480 Local memory type: Global Local memory size: 32768 Max pipe arguments: 16 Max pipe active reservations: 3276 Max pipe packet size: 1024 Max global variable size: 65536 Max global variable preferred total size: 65536 Max read/write image args: 480 Max on device events: 4294967295 Queue on device max size: 4294967295 Max on device queues: 4294967295 Queue on device preferred size: 4294967295 SVM capabilities: Coarse grain buffer: Yes Fine grain buffer: Yes Fine grain system: Yes Atomics: Yes Preferred platform atomic alignment: 64 Preferred global atomic alignment: 64 Preferred local atomic alignment: 0 Kernel Preferred work group size multiple: 128 Error correction support: 0 Unified memory for Host and Device: 1 Profiling timer resolution: 1 Device endianess: Little Available: Yes Compiler available: Yes Execution capabilities: Execute OpenCL kernels: Yes Execute native function: Yes Queue on Host properties: Out-of-Order: Yes Profiling : Yes Queue on Device properties: Out-of-Order: Yes Profiling : Yes Platform ID: 0x562e4f9661e0 Name: Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz Vendor: Intel(R) Corporation Device OpenCL C version: OpenCL C 2.0 Driver version: 18.1.0.0920 Profile: FULL_PROFILE Version: OpenCL 2.1 (Build 0) Extensions: cl_khr_icd cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_depth_images cl_khr_3d_image_writes cl_intel_exec_by_local_thread cl_khr_spir cl_khr_fp64 cl_khr_image2d_from_buffer cl_intel_vec_len_hint Platform Name: AMD Accelerated Parallel Processing Number of devices: 0 ```

rocminfo output:

``` $ rocminfo ROCk module is loaded ===================== HSA System Attributes ===================== Runtime Version: 1.1 System Timestamp Freq.: 1000.000000MHz Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count) Machine Model: LARGE System Endianness: LITTLE ========== HSA Agents ========== ******* Agent 1 ******* Name: Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz Uuid: CPU-XX Marketing Name: Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz Vendor Name: CPU Feature: None specified Profile: FULL_PROFILE Float Round Mode: NEAR Max Queue Number: 0(0x0) Queue Min Size: 0(0x0) Queue Max Size: 0(0x0) Queue Type: MULTI Node: 0 Device Type: CPU Cache Info: L1: 32768(0x8000) KB Chip ID: 0(0x0) Cacheline Size: 64(0x40) Max Clock Freq. (MHz): 3600 BDFID: 0 Internal Node ID: 0 Compute Unit: 40 SIMDs per CU: 0 Shader Engines: 0 Shader Arrs. per Eng.: 0 WatchPts on Addr. Ranges:1 Features: None Pool Info: Pool 1 Segment: GLOBAL; FLAGS: FINE GRAINED Size: 32838116(0x1f511e4) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Alignment: 4KB Accessible by all: TRUE Pool 2 Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED Size: 32838116(0x1f511e4) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Alignment: 4KB Accessible by all: TRUE Pool 3 Segment: GLOBAL; FLAGS: COARSE GRAINED Size: 32838116(0x1f511e4) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Alignment: 4KB Accessible by all: TRUE ISA Info: ******* Agent 2 ******* Name: Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz Uuid: CPU-XX Marketing Name: Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz Vendor Name: CPU Feature: None specified Profile: FULL_PROFILE Float Round Mode: NEAR Max Queue Number: 0(0x0) Queue Min Size: 0(0x0) Queue Max Size: 0(0x0) Queue Type: MULTI Node: 1 Device Type: CPU Cache Info: L1: 32768(0x8000) KB Chip ID: 0(0x0) Cacheline Size: 64(0x40) Max Clock Freq. (MHz): 3600 BDFID: 0 Internal Node ID: 1 Compute Unit: 40 SIMDs per CU: 0 Shader Engines: 0 Shader Arrs. per Eng.: 0 WatchPts on Addr. Ranges:1 Features: None Pool Info: Pool 1 Segment: GLOBAL; FLAGS: FINE GRAINED Size: 33014992(0x1f7c4d0) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Alignment: 4KB Accessible by all: TRUE Pool 2 Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED Size: 33014992(0x1f7c4d0) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Alignment: 4KB Accessible by all: TRUE Pool 3 Segment: GLOBAL; FLAGS: COARSE GRAINED Size: 33014992(0x1f7c4d0) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Alignment: 4KB Accessible by all: TRUE ISA Info: ******* Agent 3 ******* Name: gfx1030 Uuid: GPU-XX Marketing Name: AMD Radeon RX 6800 XT Vendor Name: AMD Feature: KERNEL_DISPATCH Profile: BASE_PROFILE Float Round Mode: NEAR Max Queue Number: 128(0x80) Queue Min Size: 64(0x40) Queue Max Size: 131072(0x20000) Queue Type: MULTI Node: 2 Device Type: GPU Cache Info: L1: 16(0x10) KB L2: 4096(0x1000) KB L3: 131072(0x20000) KB Chip ID: 29631(0x73bf) Cacheline Size: 64(0x40) Max Clock Freq. (MHz): 2575 BDFID: 1280 Internal Node ID: 2 Compute Unit: 72 SIMDs per CU: 2 Shader Engines: 8 Shader Arrs. per Eng.: 2 WatchPts on Addr. Ranges:4 Features: KERNEL_DISPATCH Fast F16 Operation: TRUE Wavefront Size: 32(0x20) Workgroup Max Size: 1024(0x400) Workgroup Max Size per Dimension: x 1024(0x400) y 1024(0x400) z 1024(0x400) Max Waves Per CU: 32(0x20) Max Work-item Per CU: 1024(0x400) Grid Max Size: 4294967295(0xffffffff) Grid Max Size per Dimension: x 4294967295(0xffffffff) y 4294967295(0xffffffff) z 4294967295(0xffffffff) Max fbarriers/Workgrp: 32 Pool Info: Pool 1 Segment: GLOBAL; FLAGS: COARSE GRAINED Size: 16760832(0xffc000) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Alignment: 4KB Accessible by all: FALSE Pool 2 Segment: GROUP Size: 64(0x40) KB Allocatable: FALSE Alloc Granule: 0KB Alloc Alignment: 0KB Accessible by all: FALSE ISA Info: ISA 1 Name: amdgcn-amd-amdhsa--gfx1030 Machine Models: HSA_MACHINE_MODEL_LARGE Profiles: HSA_PROFILE_BASE Default Rounding Mode: NEAR Default Rounding Mode: NEAR Fast f16: TRUE Workgroup Max Size: 1024(0x400) Workgroup Max Size per Dimension: x 1024(0x400) y 1024(0x400) z 1024(0x400) Grid Max Size: 4294967295(0xffffffff) Grid Max Size per Dimension: x 4294967295(0xffffffff) y 4294967295(0xffffffff) z 4294967295(0xffffffff) FBarrier Max Size: 32 *** Done *** ```

The likely problem was traced back to

https://github.com/RadeonOpenCompute/ROCm-OpenCL-Runtime/blob/bbdc87e08b322d349f82bdd7575c8ce94d31d276/tools/clinfo/clinfo.cpp#L124

and then

https://github.com/RadeonOpenCompute/ROCm-OpenCL-Runtime/blob/bbdc87e08b322d349f82bdd7575c8ce94d31d276/tools/clinfo/clinfo.cpp#L115

vsytch commented 2 years ago

Make sure you have the latest ROCm installation. gfx1030 support is enabled in the 5.1 branch (see https://github.com/ROCm-Developer-Tools/ROCclr/blob/rocm-5.1.x/device/device.cpp#L183).

FluxusMagna commented 2 years ago

I am using 5.1.1 so that should not be the issue.

acxz commented 2 years ago

@vsytch it would be helpful if you could point us to the logic of platform.getDevices to help us trace this down. As in how (and where in the code, specifically) are the devices queried from the hardware?

Mystro256 commented 2 years ago

Yeah I noticed this too. My Raven APU shows up, but I don't see the gfx1030.

Mystro256 commented 2 years ago

Can you run:

AMD_LOG_LEVEL=4 clinfo

I just realised that my install of compiler (clang/comgr/llvm) was messed up. E.g. it was trying to use comgr with an older clang/llvm somehow, so it obviously failed. There might be an issue with the rocm-arch packages. By default comgr should statically link against clang, but it is possible to dynamically link it, which is what I did.

FluxusMagna commented 2 years ago

That sounds very plausible. I got it to work with repackaged Ubuntu packages, so it is likely something related to the arch packages. I forgot to post that here though. At the moment I have no quick way to test the hypothesis as the machine is currently in use for relatively urgent work, but you can close the issue if you see fit.

ROCm / ROCm-OpenCL-Runtime

gfx1030 does not show up as OpenCL device #144