ROCm / ROCm-OpenCL-Runtime

ROCm OpenOpenCL Runtime
170 stars 60 forks source link

OpenCL 2.0 support for Ellesmere (RX580) #127

Open sudden6 opened 3 years ago

sudden6 commented 3 years ago

Hi all,

is there OpenCL 2.0 support for the Ellesmere (RX580) device? Specifically I want to use enqueue_kernel(...) in my OpenCL kernel.

The card is advertised on the AMD Website as "OpenCL 2.0", but clinfo gives me the following (confusing) output:

Device Version                                  OpenCL 1.2 
Driver Version                                  3186.0 (HSA1.1,LC)
Device OpenCL C Version                         OpenCL C 2.0 
Device Type                                     GPU
Device Board Name (AMD)                         Ellesmere [Radeon RX 470/480/570/570X/580/580X/590]

Additionally when I try to compile an OpenCL kernel with –cl-std=CL2.0 the driver gives me the error code -66, which according to the StreamHPC listing means "the compiler options specified by options are invalid.", but clinfo says OpenCL C 2.0 should be supported (at least in my understanding).

To summarize:

gandryey commented 3 years ago
sudden6 commented 3 years ago

Thank you for your response!

I will try the amdgpu-pro stack with PAL and see if it works there.

sudden6 commented 3 years ago

I tried with amdgpu-pro version 20.40 and there's no support for OpenCL 2.0 on my GPU.

When I do amdgpu-pro-install --opencl=pal I get no devices with clinfo and when I try with amdgpu-pro-install --opencl=legacy,pal it detects my GPU, but doesn't show OpenCL 2.0 as supported.

Is there a specific version I need? Is there a better place to ask about amdgpu-pro problems?

It seems to work in Windows, but if possible I'd prefer to develop on Linux.

gandryey commented 3 years ago

There are some complications around Linux support, especially with the older generations.

clapbr commented 3 years ago

There are some complications around Linux support, especially with the older generations.

* Try "export GPU_ENABLE_PAL=1"

* If that won't work, then try just PAL installation: amdgpu-pro-install --opencl=pal and then "export GPU_ENABLE_PAL=1"
  I hope that will help.

Tried that with 19.50, up to 20.40. Was close to get it working but it segfaults halfway clinfo output, although recognizes my RX580 as a Device Version OpenCL 2.0. Other drivers (pro ocl legacy/orca and rocm) work fine, but only OpenCL 1.2 so seems we're SOL in regards to OCL2.

Where did you find about the GPU_ENABLE_PAL env var, didnt find any ref to it?

gandryey commented 3 years ago

https://github.com/ROCm-Developer-Tools/ROCclr/blob/5cefcaf62893fcd86c8feed6bb1ebb84850fcd2f/utils/flags.hpp#L146

MathiasMagnus commented 2 years ago

This custom build operates OpenCL fine, I tested with OpenCL-OpenGL interop samples as well. I've absolutely no idea why APT packages don't sport a build with the off-by-default Polaris support enabled.

Ubuntu 20.04, ROCm 5.0 works like this

``` mate@GL702ZC:~$ sudo amdgpu-install --usecase=graphics,hiplibsdk,opencl --no-dkms --no-32 mate@GL702ZC:~$ wget https://github.com/xuhuisheng/rocm-gfx803/releases/download/rocm500/rocm-opencl_2.0.0-local_amd64.deb mate@GL702ZC:~$ sudo dpkg -i ./rocm-opencl_2.0.0-local_amd64.deb mate@GL702ZC:~$ /opt/rocm/bin/rocm-smi ======================= ROCm System Management Interface ======================= ================================= Concise Info ================================= GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% 0 65.0c 17.011W 608Mhz 2000Mhz 0% auto 68.0W 19% 0% ================================================================================ WARNING: One or more commands failed ============================= End of ROCm SMI Log ============================== mate@GL702ZC:~$ /opt/rocm/bin/rocminfo ROCk module is loaded ===================== HSA System Attributes ===================== Runtime Version: 1.1 System Timestamp Freq.: 1000.000000MHz Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count) Machine Model: LARGE System Endianness: LITTLE ========== HSA Agents ========== ******* Agent 1 ******* Name: AMD Ryzen 7 1700 Eight-Core Processor Uuid: CPU-XX Marketing Name: AMD Ryzen 7 1700 Eight-Core Processor Vendor Name: CPU Feature: None specified Profile: FULL_PROFILE Float Round Mode: NEAR Max Queue Number: 0(0x0) Queue Min Size: 0(0x0) Queue Max Size: 0(0x0) Queue Type: MULTI Node: 0 Device Type: CPU Cache Info: L1: 32768(0x8000) KB Chip ID: 0(0x0) Cacheline Size: 64(0x40) Max Clock Freq. (MHz): 3000 BDFID: 0 Internal Node ID: 0 Compute Unit: 16 SIMDs per CU: 0 Shader Engines: 0 Shader Arrs. per Eng.: 0 WatchPts on Addr. Ranges:1 Features: None Pool Info: Pool 1 Segment: GLOBAL; FLAGS: FINE GRAINED Size: 32805700(0x1f49344) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Alignment: 4KB Accessible by all: TRUE Pool 2 Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED Size: 32805700(0x1f49344) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Alignment: 4KB Accessible by all: TRUE Pool 3 Segment: GLOBAL; FLAGS: COARSE GRAINED Size: 32805700(0x1f49344) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Alignment: 4KB Accessible by all: TRUE ISA Info: ******* Agent 2 ******* Name: gfx803 Uuid: GPU-XX Marketing Name: Radeon RX 580 Series Vendor Name: AMD Feature: KERNEL_DISPATCH Profile: BASE_PROFILE Float Round Mode: NEAR Max Queue Number: 128(0x80) Queue Min Size: 4096(0x1000) Queue Max Size: 131072(0x20000) Queue Type: MULTI Node: 1 Device Type: GPU Cache Info: L1: 16(0x10) KB Chip ID: 26591(0x67df) Cacheline Size: 64(0x40) Max Clock Freq. (MHz): 1077 BDFID: 3072 Internal Node ID: 1 Compute Unit: 36 SIMDs per CU: 4 Shader Engines: 4 Shader Arrs. per Eng.: 1 WatchPts on Addr. Ranges:4 Features: KERNEL_DISPATCH Fast F16 Operation: TRUE Wavefront Size: 64(0x40) Workgroup Max Size: 1024(0x400) Workgroup Max Size per Dimension: x 1024(0x400) y 1024(0x400) z 1024(0x400) Max Waves Per CU: 40(0x28) Max Work-item Per CU: 2560(0xa00) Grid Max Size: 4294967295(0xffffffff) Grid Max Size per Dimension: x 4294967295(0xffffffff) y 4294967295(0xffffffff) z 4294967295(0xffffffff) Max fbarriers/Workgrp: 32 Pool Info: Pool 1 Segment: GLOBAL; FLAGS: COARSE GRAINED Size: 4194304(0x400000) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Alignment: 4KB Accessible by all: FALSE Pool 2 Segment: GROUP Size: 64(0x40) KB Allocatable: FALSE Alloc Granule: 0KB Alloc Alignment: 0KB Accessible by all: FALSE ISA Info: ISA 1 Name: amdgcn-amd-amdhsa--gfx803 Machine Models: HSA_MACHINE_MODEL_LARGE Profiles: HSA_PROFILE_BASE Default Rounding Mode: NEAR Default Rounding Mode: NEAR Fast f16: TRUE Workgroup Max Size: 1024(0x400) Workgroup Max Size per Dimension: x 1024(0x400) y 1024(0x400) z 1024(0x400) Grid Max Size: 4294967295(0xffffffff) Grid Max Size per Dimension: x 4294967295(0xffffffff) y 4294967295(0xffffffff) z 4294967295(0xffffffff) FBarrier Max Size: 32 *** Done *** mate@GL702ZC:~$ /opt/rocm/opencl/bin/clinfo Number of platforms: 1 Platform Profile: FULL_PROFILE Platform Version: OpenCL 2.2 AMD-APP.dbg (3406.0) Platform Name: AMD Accelerated Parallel Processing Platform Vendor: Advanced Micro Devices, Inc. Platform Extensions: cl_khr_icd cl_amd_event_callback Platform Name: AMD Accelerated Parallel Processing Number of devices: 1 Device Type: CL_DEVICE_TYPE_GPU Vendor ID: 1002h Board name: Radeon RX 580 Series Device Topology: PCI[ B#12, D#0, F#0 ] Max compute units: 36 Max work items dimensions: 3 Max work items[0]: 1024 Max work items[1]: 1024 Max work items[2]: 1024 Max work group size: 256 Preferred vector width char: 4 Preferred vector width short: 2 Preferred vector width int: 1 Preferred vector width long: 1 Preferred vector width float: 1 Preferred vector width double: 1 Native vector width char: 4 Native vector width short: 2 Native vector width int: 1 Native vector width long: 1 Native vector width float: 1 Native vector width double: 1 Max clock frequency: 1077Mhz Address bits: 64 Max memory allocation: 3650722200 Image support: Yes Max number of images read arguments: 128 Max number of images write arguments: 8 Max image 2D width: 16384 Max image 2D height: 16384 Max image 3D width: 16384 Max image 3D height: 16384 Max image 3D depth: 8192 Max samplers within kernel: 26591 Max size of kernel argument: 1024 Alignment (bits) of base address: 1024 Minimum alignment (bytes) for any datatype: 128 Single precision floating point capability Denorms: No Quiet NaNs: Yes Round to nearest even: Yes Round to zero: Yes Round to +ve and infinity: Yes IEEE754-2008 fused multiply-add: Yes Cache type: Read/Write Cache line size: 64 Cache size: 16384 Global memory size: 4294967296 Constant buffer size: 3650722200 Max number of constant args: 8 Local memory type: Scratchpad Local memory size: 65536 Max pipe arguments: 16 Max pipe active reservations: 16 Max pipe packet size: 3650722200 Max global variable size: 3650722200 Max global variable preferred total size: 4294967296 Max read/write image args: 64 Max on device events: 1024 Queue on device max size: 8388608 Max on device queues: 1 Queue on device preferred size: 262144 SVM capabilities: Coarse grain buffer: Yes Fine grain buffer: Yes Fine grain system: No Atomics: No Preferred platform atomic alignment: 0 Preferred global atomic alignment: 0 Preferred local atomic alignment: 0 Kernel Preferred work group size multiple: 64 Error correction support: 0 Unified memory for Host and Device: 0 Profiling timer resolution: 1 Device endianess: Little Available: Yes Compiler available: Yes Execution capabilities: Execute OpenCL kernels: Yes Execute native function: No Queue on Host properties: Out-of-Order: No Profiling : Yes Queue on Device properties: Out-of-Order: Yes Profiling : Yes Platform ID: 0x7f3cc2c2bc80 Name: gfx803 Vendor: Advanced Micro Devices, Inc. Device OpenCL C version: OpenCL C 2.0 Driver version: 3406.0 (HSA1.1,LC) Profile: FULL_PROFILE Version: OpenCL 1.2 Extensions: cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_media_ops cl_amd_media_ops2 cl_khr_image2d_from_buffer cl_khr_subgroups cl_khr_depth_images cl_amd_copy_buffer_p2p cl_amd_assembly_program ```

FelixSchwarz commented 2 years ago

I've absolutely no idea why APT packages don't sport a build with the off-by-default Polaris support enabled.

Usually distro packager try to follow upstream as they have not enough resources supporting otherwise untested configurations. Also there are some open rocBLAS bugs for gfx803 (aka "Polaris") which won't get fixed (at least not by AMD):

Granted, these bugs don't affect OpenCL but you might get the idea why distro packagers are not too keen on enabling otherwise unsupported configurations.

That being said at least in Fedora we could enable OpenCL for Polaris (once we figure out how to package it in a sane way) if we had trusted users who can run test builds and have some test suites to ensure everything keeps working. You could consider joining the debian-ai mailing list to provide feedback once they finished packaging OpenCL for debian.

MathiasMagnus commented 2 years ago

This "rant" wasn't aimed at Debian/Canonical or other distro maintainers. AMD offers their own APT repo which is their recommended way of installing their software. Note that they don't endorse using AUR or other 3rd party packaging efforts but documentation suggests using their repo on the supported platforms. Period.

gfx803 is still in a "partial support" state, but it isn't clear whether partial means things may not work as intended or if we disable entire APIs due to lack of testing/assets towards bug fixes.

All I'm saying is that this "partial support" technique doesn't seem to do much service to users. It's weird how AMD says they don't test the OpenCL runtime on Polaris.

Having that said I can run occasional tests, but I don't have the bandwidth to do much more than that.