[Issue]: CO lookup in fatbin should only fail when none of the GPUs have matching CO

GZGavinZhao commented 5 months ago

Problem Description

I noticed that in a multi-GPU system (in my case, a gfx90c iGPU and a gfx1032 dGPU), a fat binary must have code objects for all architectures in order to run and produce correct outputs, yet most of the time I only want to run on one architecture. This also poses issues for users that have an integrated GPU because obviously no one would compile against an iGPU, meaning that I have to use HIP_VISIBLE_DEVICES to limit access to only the dGPU every time I run a ROCm binary or libraries like PyTorch.

I believe the problem is with this line, where we set hip_status to hipErrorNoBinaryForGpu even if only one device has unmatched CO. We should only set hip_status to hipErrorNoBinaryForGpu if none of the devices have matching CO.

Operating System

Solus 4.5 Resilience

CPU

AMD Ryzen 7 5800H with Radeon Graphics

GPU

AMD Radeon RX6600M

ROCm Version

ROCm 6.0.0

ROCm Component

clr

Steps to Reproduce

The following assumes one has two GPUs with incompatible architectures. In my case, I have a gfx1032 (device index 0) and gfx90c (device index 1). Please adjust the arch names accordingly.

Use the official vectorAdd example. Compile against only the architecture with device index 0: hipcc --offload-arch=gfx1032 -o vectoradd_hip vectoradd_hip.cpp
Run AMD_LOG_LEVEL=1 ./vectoradd_hip

Now I get the following error:


:1:hip_fatbin.cpp           :256 : 1271514880 us: [pid:7468  tid:0x7f6318e9ca80] Cannot find CO in the bundle for ISA: amdgcn-amd-amdhsa--gfx90c:xnack-

:1:hip_fatbin.cpp :109 : 1271514917 us: [pid:7468 tid:0x7f6318e9ca80] Missing CO for these ISAs - :1:hip_fatbin.cpp :112 : 1271514929 us: [pid:7468 tid:0x7f6318e9ca80] amdgcn-amd-amdhsa--gfx90c:xnack- :1:hip_fatbin.cpp :302 : 1271514949 us: [pid:7468 tid:0x7f6318e9ca80] Releasing COMGR data failed with status 2 System minor 3 System major 10 agent prop name AMD Radeon RX 6600M hip Device prop succeeded FAILED: 1048576 errors :1:hip_fatbin.cpp :83 : 1271770378 us: [pid:7468 tid:0x7f6318e9ca80] All Unique FDs are closed

4. However, if I hide the GPU with device index 1 by running `HIP_VISIBLE_DEVICES=0 AMD_LOG_LEVEL=1 ./vectoradd_hip`, I get:

System minor 3 System major 10 agent prop name AMD Radeon RX 6600M hip Device prop succeeded PASSED! :1:hip_fatbin.cpp :83 : 1584583122 us: [pid:7749 tid:0x7fe017517a80] All Unique FDs are closed


### (Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

<details>
<summary>rocminfo output</summary>

ROCk module is loaded =====================
HSA System Attributes
=====================
Runtime Version: 1.1 System Timestamp Freq.: 1000.000000MHz Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count) Machine Model: LARGE
System Endianness: LITTLE
Mwaitx: DISABLED DMAbuf Support: YES

==========
HSA Agents
==========

Agent 1

Name: AMD Ryzen 7 5800H with Radeon Graphics Uuid: CPU-XX
Marketing Name: AMD Ryzen 7 5800H with Radeon Graphics Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 3201
BDFID: 0
Internal Node ID: 0
Compute Unit: 16
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Features: None Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 61576860(0x3ab969c) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED Size: 61576860(0x3ab969c) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 61576860(0x3ab969c) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:

Agent 2

Name: gfx1032
Uuid: GPU-XX
Marketing Name: AMD Radeon RX 6600M
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 1
Device Type: GPU
Cache Info:
L1: 16(0x10) KB
L2: 2048(0x800) KB
L3: 32768(0x8000) KB
Chip ID: 29695(0x73ff)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 2720
BDFID: 768
Internal Node ID: 1
Compute Unit: 28
SIMDs per CU: 2
Shader Engines: 2
Shader Arrs. per Eng.: 2
WatchPts on Addr. Ranges:4
Coherent Host Access: FALSE
Features: KERNEL_DISPATCH Fast F16 Operation: TRUE
Wavefront Size: 32(0x20)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension: x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 32(0x20)
Max Work-item Per CU: 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension: x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 115
SDMA engine uCode:: 76
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 8372224(0x7fc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED Size: 8372224(0x7fc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx1032
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension: x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension: x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32

Agent 3

Name: gfx90c
Uuid: GPU-XX
Marketing Name: AMD Radeon Graphics
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 2
Device Type: GPU
Cache Info:
L1: 16(0x10) KB
L2: 1024(0x400) KB
Chip ID: 5688(0x1638)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 2000
BDFID: 2048
Internal Node ID: 2
Compute Unit: 8
SIMDs per CU: 4
Shader Engines: 1
Shader Arrs. per Eng.: 1
WatchPts on Addr. Ranges:4
Coherent Host Access: FALSE
Features: KERNEL_DISPATCH Fast F16 Operation: TRUE
Wavefront Size: 64(0x40)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension: x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 40(0x28)
Max Work-item Per CU: 2560(0xa00)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension: x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 469
SDMA engine uCode:: 40
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 4194304(0x400000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED Size: 4194304(0x400000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx90c:xnack-
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension: x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension: x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
Done



</details>

### Additional Information

_No response_

cjatin commented 5 months ago

As far as I remember this change was added to fix some issues we had, previous to this we had a change somewhat similar to what you raised.

The first step HIP program does is load up the codeobjects in the ELF, if all the code objects are not presents this can lead to a situation where user can query two devices via hipGetDeviceCount but not launch the kernels on them by setting hipSetDevice. There are several libs which do this to maximize the throughput of their GPU, iter the available GPUs and offload work on them.

HIP_VISIBLE_DEVICES solves this issue, but yeah I get the problem it creates for iGPUs. I still recommend you to set HIP_VISIBLE_DEVICES global env variable to get by this inconvenience.

GZGavinZhao commented 5 months ago

I see, so this is an intended behavior then. Thank you for your explanation!

ROCm / clr