ROCm / aomp

AOMP is an open source Clang/LLVM based compiler with added support for the OpenMP® API on Radeon™ GPUs. Use this repository for releases, issues, documentation, packaging, and examples.
https://github.com/ROCm/aomp
Apache License 2.0
206 stars 47 forks source link

ailure to load binary image ... Error in hsa_executable_load_code_object: HSA_STATUS_ERROR_INCOMPATIBLE_ARGUMENTS: The arguments passed to a functions are not compatible #888

Open woodard opened 7 months ago

woodard commented 7 months ago

Problem Description

This problem actually happens on ROCm 6.1 but that was not available in the dropdown list.

When I try to compile for both the GFX906 and the GFX1100 it looks like the runtime tries to load the wrong GPU code in many cases.

[ben@darkstar Audit]$ make clean
rm -rf *.o *.x
[ben@darkstar Audit]$ make ARCH=gfx906 CC=/opt/rocm-6.1.0/bin/amdclang++
ARCH = gfx906
/opt/rocm-6.1.0/bin/amdclang++ -g -O2 -fopenmp --offload-arch=gfx906 ddot.c -O2 -o ddot.amdclang++.x
clang++: warning: treating 'c' input as 'c++' when in C++ mode, this behavior is deprecated [-Wdeprecated]
[ben@darkstar Audit]$ ./ddot.amdclang++.x
Input vector length N:
20
Using N = 20
Success! Result = 4.940000e+03
Initialization took 0.000 seconds.
Computation took 0.000 seconds.

[ben@darkstar Audit]$ make ARCH=gfx1100 CC=/opt/rocm-6.1.0/bin/amdclang++
ARCH = gfx1100
/opt/rocm-6.1.0/bin/amdclang++ -g -O2 -fopenmp --offload-arch=gfx1100 ddot.c -O2 -o ddot.amdclang++.x
clang++: warning: treating 'c' input as 'c++' when in C++ mode, this behavior is deprecated [-Wdeprecated]
[ben@darkstar Audit]$ ./ddot.amdclang++.x
Input vector length N:
20
Using N = 20
"PluginInterface" error: Failure to load binary image 0x13d46e0 on device 0: Error in hsa_executable_load_code_object: HSA_STATUS_ERROR_INCOMPATIBLE_ARGUMENTS: The arguments passed to a functions are not compatible.
Libomptarget error: Unable to generate entries table for device id 0.
Libomptarget error: Failed to init globals on device 0
Libomptarget error: Consult https://openmp.llvm.org/design/Runtimes.html for debugging options.
ddot.c:32:3: Libomptarget fatal error 1: failure of target construct while offloading is mandatory
Aborted (core dumped)

The same thing happens with ARCH=gfx1100,gfx906 fails

ARCH=gfx906,gfx1100

I haven't fully debugged the problem but it seems like the logic to select the code segment for the GPU gets confused when there are two different GPUs installed.

I found old bug reports of people having a similar problem all the way back to ROCm 1.0 when they had mixed GPU cards.

I believe that this is ultimately an important testcase to support. What we would like to get to is a place where a user may have an APU and a dGPU and then they startup a small workload and run into problems rather than having both units working on it.

I brought this up in https://github.com/ROCm/ROCR-Runtime/issues/198 thinking that it was a problem with the runtime but after troubleshooting it a bit there, I think that the problem is actually in the AMD OpenMP implementation for ROCm within clang.

By limiting the devices available to the program, we were able to make it run.

[ben@darkstar ddot]$ make ARCH=gfx906,gfx1100 CC=/opt/rocm-6.1.0/bin/amdclang
ARCH = gfx906,gfx1100
/opt/rocm-6.1.0/bin/amdclang -g -O2 -fopenmp --offload-arch=gfx906,gfx1100 ddot.c -O2 -o ddot.amdclang.x
[ben@darkstar ddot]$ ./ddot.amdclang.x
Input vector length N:
20
Using N = 20
"PluginInterface" error: Failure to load binary image 0x10b46f0 on device 0: Error in hsa_executable_load_code_object: HSA_STATUS_ERROR_INCOMPATIBLE_ARGUMENTS: The arguments passed to a functions are not compatible.
Libomptarget error: Unable to generate entries table for device id 0.
Libomptarget error: Failed to init globals on device 0
Libomptarget error: Consult https://openmp.llvm.org/design/Runtimes.html for debugging options.
ddot.c:32:3: Libomptarget fatal error 1: failure of target construct while offloading is mandatory
Aborted (core dumped)
[ben@darkstar ddot]$ ROCR_VISIBLE_DEVICES=0 !!
ROCR_VISIBLE_DEVICES=0 ./ddot.amdclang.x
Input vector length N:
20
Using N = 20
Success! Result = 4.940000e+03
Initialization took 0.000 seconds.
Computation took 0.000 seconds.
[ben@darkstar ddot]$ ROCR_VISIBLE_DEVICES=1 ./ddot.amdclang.x
Input vector length N:
20
Using N = 20
Success! Result = 4.940000e+03
Initialization took 0.000 seconds.
Computation took 0.000 seconds.

Since the code is only OpenMP and doesn't have any HSA code in it, the problem must be in the code generated by the compiler by the OpenMP implementation.

Operating System

NAME="Red Hat Enterprise Linux" VERSION="9.3 (Plow)"

CPU

model name : AMD Ryzen Threadripper PRO 5955WX 16-Cores

GPU

AMD Radeon Pro VII, AMD Radeon RX 7900 XTX

ROCm Version

ROCm 6.0.0

ROCm Component

aomp

Steps to Reproduce

see above

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

[ben@darkstar try1]$ /opt/rocm/bin/rocminfo --support ROCk module version 6.3.6 is loaded HSA System Attributes

Runtime Version: 1.13 Runtime Ext Version: 1.4 System Timestamp Freq.: 1000.000000MHz Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count) Machine Model: LARGE System Endianness: LITTLE Mwaitx: DISABLED DMAbuf Support: YES

HSA Agents

Agent 1

Name: AMD Ryzen Threadripper PRO 5955WX 16-Cores Uuid: CPU-XX Marketing Name: AMD Ryzen Threadripper PRO 5955WX 16-Cores Vendor Name: CPU Feature: None specified Profile: FULL_PROFILE Float Round Mode: NEAR Max Queue Number: 0(0x0) Queue Min Size: 0(0x0) Queue Max Size: 0(0x0) Queue Type: MULTI Node: 0 Device Type: CPU Cache Info: L1: 32768(0x8000) KB Chip ID: 0(0x0) ASIC Revision: 0(0x0) Cacheline Size: 64(0x40) Max Clock Freq. (MHz): 4000 BDFID: 0 Internal Node ID: 0 Compute Unit: 32 SIMDs per CU: 0 Shader Engines: 0 Shader Arrs. per Eng.: 0 WatchPts on Addr. Ranges:1 Features: None Pool Info: Pool 1 Segment: GLOBAL; FLAGS: FINE GRAINED Size: 131192084(0x7d1d514) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Recommended Granule:4KB Alloc Alignment: 4KB Accessible by all: TRUE Pool 2 Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED Size: 131192084(0x7d1d514) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Recommended Granule:4KB Alloc Alignment: 4KB Accessible by all: TRUE Pool 3 Segment: GLOBAL; FLAGS: COARSE GRAINED Size: 131192084(0x7d1d514) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Recommended Granule:4KB Alloc Alignment: 4KB Accessible by all: TRUE ISA Info:

Agent 2

Name: gfx906 Uuid: GPU-ac4210e173c71c04 Marketing Name: AMD Radeon (TM) Pro VII Vendor Name: AMD Feature: KERNEL_DISPATCH Profile: BASE_PROFILE Float Round Mode: NEAR Max Queue Number: 128(0x80) Queue Min Size: 64(0x40) Queue Max Size: 131072(0x20000) Queue Type: MULTI Node: 1 Device Type: GPU Cache Info: L1: 16(0x10) KB L2: 8192(0x2000) KB Chip ID: 26273(0x66a1) ASIC Revision: 1(0x1) Cacheline Size: 64(0x40) Max Clock Freq. (MHz): 1700 BDFID: 25344 Internal Node ID: 1 Compute Unit: 60 SIMDs per CU: 4 Shader Engines: 4 Shader Arrs. per Eng.: 1 WatchPts on Addr. Ranges:4 Coherent Host Access: FALSE Features: KERNEL_DISPATCH Fast F16 Operation: TRUE Wavefront Size: 64(0x40) Workgroup Max Size: 1024(0x400) Workgroup Max Size per Dimension: x 1024(0x400) y 1024(0x400) z 1024(0x400) Max Waves Per CU: 40(0x28) Max Work-item Per CU: 2560(0xa00) Grid Max Size: 4294967295(0xffffffff) Grid Max Size per Dimension: x 4294967295(0xffffffff) y 4294967295(0xffffffff) z 4294967295(0xffffffff) Max fbarriers/Workgrp: 32 Packet Processor uCode:: 469 SDMA engine uCode:: 145 IOMMU Support:: None Pool Info: Pool 1 Segment: GLOBAL; FLAGS: COARSE GRAINED Size: 16760832(0xffc000) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Recommended Granule:2048KB Alloc Alignment: 4KB Accessible by all: FALSE Pool 2 Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED Size: 16760832(0xffc000) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Recommended Granule:2048KB Alloc Alignment: 4KB Accessible by all: FALSE Pool 3 Segment: GROUP Size: 64(0x40) KB Allocatable: FALSE Alloc Granule: 0KB Alloc Recommended Granule:0KB Alloc Alignment: 0KB Accessible by all: FALSE ISA Info: ISA 1 Name: amdgcn-amd-amdhsa--gfx906:sramecc+:xnack- Machine Models: HSA_MACHINE_MODEL_LARGE Profiles: HSA_PROFILE_BASE Default Rounding Mode: NEAR Default Rounding Mode: NEAR Fast f16: TRUE Workgroup Max Size: 1024(0x400) Workgroup Max Size per Dimension: x 1024(0x400) y 1024(0x400) z 1024(0x400) Grid Max Size: 4294967295(0xffffffff) Grid Max Size per Dimension: x 4294967295(0xffffffff) y 4294967295(0xffffffff) z 4294967295(0xffffffff) FBarrier Max Size: 32

Agent 3

Name: gfx1100 Uuid: GPU-32718d9ad8c12635 Marketing Name: Radeon RX 7900 XTX Vendor Name: AMD Feature: KERNEL_DISPATCH Profile: BASE_PROFILE Float Round Mode: NEAR Max Queue Number: 128(0x80) Queue Min Size: 64(0x40) Queue Max Size: 131072(0x20000) Queue Type: MULTI Node: 2 Device Type: GPU Cache Info: L1: 32(0x20) KB L2: 6144(0x1800) KB L3: 98304(0x18000) KB Chip ID: 29772(0x744c) ASIC Revision: 0(0x0) Cacheline Size: 64(0x40) Max Clock Freq. (MHz): 2304 BDFID: 17152 Internal Node ID: 2 Compute Unit: 96 SIMDs per CU: 2 Shader Engines: 6 Shader Arrs. per Eng.: 2 WatchPts on Addr. Ranges:4 Coherent Host Access: FALSE Features: KERNEL_DISPATCH Fast F16 Operation: TRUE Wavefront Size: 32(0x20) Workgroup Max Size: 1024(0x400) Workgroup Max Size per Dimension: x 1024(0x400) y 1024(0x400) z 1024(0x400) Max Waves Per CU: 32(0x20) Max Work-item Per CU: 1024(0x400) Grid Max Size: 4294967295(0xffffffff) Grid Max Size per Dimension: x 4294967295(0xffffffff) y 4294967295(0xffffffff) z 4294967295(0xffffffff) Max fbarriers/Workgrp: 32 Packet Processor uCode:: 550 SDMA engine uCode:: 19 IOMMU Support:: None Pool Info: Pool 1 Segment: GLOBAL; FLAGS: COARSE GRAINED Size: 25149440(0x17fc000) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Recommended Granule:2048KB Alloc Alignment: 4KB Accessible by all: FALSE Pool 2 Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED Size: 25149440(0x17fc000) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Recommended Granule:2048KB Alloc Alignment: 4KB Accessible by all: FALSE Pool 3 Segment: GROUP Size: 64(0x40) KB Allocatable: FALSE Alloc Granule: 0KB Alloc Recommended Granule:0KB Alloc Alignment: 0KB Accessible by all: FALSE ISA Info: ISA 1 Name: amdgcn-amd-amdhsa--gfx1100 Machine Models: HSA_MACHINE_MODEL_LARGE Profiles: HSA_PROFILE_BASE Default Rounding Mode: NEAR Default Rounding Mode: NEAR Fast f16: TRUE Workgroup Max Size: 1024(0x400) Workgroup Max Size per Dimension: x 1024(0x400) y 1024(0x400) z 1024(0x400) Grid Max Size: 4294967295(0xffffffff) Grid Max Size per Dimension: x 4294967295(0xffffffff) y 4294967295(0xffffffff) z 4294967295(0xffffffff) FBarrier Max Size: 32 Done

Additional Information

see https://github.com/ROCm/ROCR-Runtime/issues/198 for the HSA enumeration API and example code for how to fix it.

woodard commented 7 months ago

ddot.tar.gz Example code - this affects all the OpenMP test programs that I've tried since I added the 2nd GPU to my system.

gregrodgers commented 7 months ago

Try setting ROCR_VISIBLE_DEVICES=0 to use first GPU or ROCR_VISIBLE_DEVICES=1 for the 2nd GPU.

woodard commented 7 months ago

I tried that. Yes that works. Note that I mentioned that in my notes starting this issue. Right after “By limiting the devices available to the program, we were able to make it run.”

My expectation is that an OpenMP program would at least try to launch the correct code on the correct GPU. I think it would be better if it launched on the most powerful GPU or on an unused GPU. And it would be great if it launched on all available GPUs.