Open woodard opened 7 months ago
ddot.tar.gz Example code - this affects all the OpenMP test programs that I've tried since I added the 2nd GPU to my system.
Try setting ROCR_VISIBLE_DEVICES=0 to use first GPU or ROCR_VISIBLE_DEVICES=1 for the 2nd GPU.
I tried that. Yes that works. Note that I mentioned that in my notes starting this issue. Right after “By limiting the devices available to the program, we were able to make it run.”
My expectation is that an OpenMP program would at least try to launch the correct code on the correct GPU. I think it would be better if it launched on the most powerful GPU or on an unused GPU. And it would be great if it launched on all available GPUs.
Problem Description
This problem actually happens on ROCm 6.1 but that was not available in the dropdown list.
When I try to compile for both the GFX906 and the GFX1100 it looks like the runtime tries to load the wrong GPU code in many cases.
The same thing happens with ARCH=gfx1100,gfx906 fails
ARCH=gfx906,gfx1100
I haven't fully debugged the problem but it seems like the logic to select the code segment for the GPU gets confused when there are two different GPUs installed.
I found old bug reports of people having a similar problem all the way back to ROCm 1.0 when they had mixed GPU cards.
I believe that this is ultimately an important testcase to support. What we would like to get to is a place where a user may have an APU and a dGPU and then they startup a small workload and run into problems rather than having both units working on it.
I brought this up in https://github.com/ROCm/ROCR-Runtime/issues/198 thinking that it was a problem with the runtime but after troubleshooting it a bit there, I think that the problem is actually in the AMD OpenMP implementation for ROCm within clang.
By limiting the devices available to the program, we were able to make it run.
Since the code is only OpenMP and doesn't have any HSA code in it, the problem must be in the code generated by the compiler by the OpenMP implementation.
Operating System
NAME="Red Hat Enterprise Linux" VERSION="9.3 (Plow)"
CPU
model name : AMD Ryzen Threadripper PRO 5955WX 16-Cores
GPU
AMD Radeon Pro VII, AMD Radeon RX 7900 XTX
ROCm Version
ROCm 6.0.0
ROCm Component
aomp
Steps to Reproduce
see above
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
[ben@darkstar try1]$ /opt/rocm/bin/rocminfo --support ROCk module version 6.3.6 is loaded HSA System Attributes
Runtime Version: 1.13 Runtime Ext Version: 1.4 System Timestamp Freq.: 1000.000000MHz Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count) Machine Model: LARGE System Endianness: LITTLE Mwaitx: DISABLED DMAbuf Support: YES
HSA Agents
Agent 1
Name: AMD Ryzen Threadripper PRO 5955WX 16-Cores Uuid: CPU-XX Marketing Name: AMD Ryzen Threadripper PRO 5955WX 16-Cores Vendor Name: CPU Feature: None specified Profile: FULL_PROFILE Float Round Mode: NEAR Max Queue Number: 0(0x0) Queue Min Size: 0(0x0) Queue Max Size: 0(0x0) Queue Type: MULTI Node: 0 Device Type: CPU Cache Info: L1: 32768(0x8000) KB Chip ID: 0(0x0) ASIC Revision: 0(0x0) Cacheline Size: 64(0x40) Max Clock Freq. (MHz): 4000 BDFID: 0 Internal Node ID: 0 Compute Unit: 32 SIMDs per CU: 0 Shader Engines: 0 Shader Arrs. per Eng.: 0 WatchPts on Addr. Ranges:1 Features: None Pool Info: Pool 1 Segment: GLOBAL; FLAGS: FINE GRAINED Size: 131192084(0x7d1d514) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Recommended Granule:4KB Alloc Alignment: 4KB Accessible by all: TRUE Pool 2 Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED Size: 131192084(0x7d1d514) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Recommended Granule:4KB Alloc Alignment: 4KB Accessible by all: TRUE Pool 3 Segment: GLOBAL; FLAGS: COARSE GRAINED Size: 131192084(0x7d1d514) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Recommended Granule:4KB Alloc Alignment: 4KB Accessible by all: TRUE ISA Info:
Agent 2
Name: gfx906 Uuid: GPU-ac4210e173c71c04 Marketing Name: AMD Radeon (TM) Pro VII Vendor Name: AMD Feature: KERNEL_DISPATCH Profile: BASE_PROFILE Float Round Mode: NEAR Max Queue Number: 128(0x80) Queue Min Size: 64(0x40) Queue Max Size: 131072(0x20000) Queue Type: MULTI Node: 1 Device Type: GPU Cache Info: L1: 16(0x10) KB L2: 8192(0x2000) KB Chip ID: 26273(0x66a1) ASIC Revision: 1(0x1) Cacheline Size: 64(0x40) Max Clock Freq. (MHz): 1700 BDFID: 25344 Internal Node ID: 1 Compute Unit: 60 SIMDs per CU: 4 Shader Engines: 4 Shader Arrs. per Eng.: 1 WatchPts on Addr. Ranges:4 Coherent Host Access: FALSE Features: KERNEL_DISPATCH Fast F16 Operation: TRUE Wavefront Size: 64(0x40) Workgroup Max Size: 1024(0x400) Workgroup Max Size per Dimension: x 1024(0x400) y 1024(0x400) z 1024(0x400) Max Waves Per CU: 40(0x28) Max Work-item Per CU: 2560(0xa00) Grid Max Size: 4294967295(0xffffffff) Grid Max Size per Dimension: x 4294967295(0xffffffff) y 4294967295(0xffffffff) z 4294967295(0xffffffff) Max fbarriers/Workgrp: 32 Packet Processor uCode:: 469 SDMA engine uCode:: 145 IOMMU Support:: None Pool Info: Pool 1 Segment: GLOBAL; FLAGS: COARSE GRAINED Size: 16760832(0xffc000) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Recommended Granule:2048KB Alloc Alignment: 4KB Accessible by all: FALSE Pool 2 Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED Size: 16760832(0xffc000) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Recommended Granule:2048KB Alloc Alignment: 4KB Accessible by all: FALSE Pool 3 Segment: GROUP Size: 64(0x40) KB Allocatable: FALSE Alloc Granule: 0KB Alloc Recommended Granule:0KB Alloc Alignment: 0KB Accessible by all: FALSE ISA Info: ISA 1 Name: amdgcn-amd-amdhsa--gfx906:sramecc+:xnack- Machine Models: HSA_MACHINE_MODEL_LARGE Profiles: HSA_PROFILE_BASE Default Rounding Mode: NEAR Default Rounding Mode: NEAR Fast f16: TRUE Workgroup Max Size: 1024(0x400) Workgroup Max Size per Dimension: x 1024(0x400) y 1024(0x400) z 1024(0x400) Grid Max Size: 4294967295(0xffffffff) Grid Max Size per Dimension: x 4294967295(0xffffffff) y 4294967295(0xffffffff) z 4294967295(0xffffffff) FBarrier Max Size: 32
Agent 3
Name: gfx1100 Uuid: GPU-32718d9ad8c12635 Marketing Name: Radeon RX 7900 XTX Vendor Name: AMD Feature: KERNEL_DISPATCH Profile: BASE_PROFILE Float Round Mode: NEAR Max Queue Number: 128(0x80) Queue Min Size: 64(0x40) Queue Max Size: 131072(0x20000) Queue Type: MULTI Node: 2 Device Type: GPU Cache Info: L1: 32(0x20) KB L2: 6144(0x1800) KB L3: 98304(0x18000) KB Chip ID: 29772(0x744c) ASIC Revision: 0(0x0) Cacheline Size: 64(0x40) Max Clock Freq. (MHz): 2304 BDFID: 17152 Internal Node ID: 2 Compute Unit: 96 SIMDs per CU: 2 Shader Engines: 6 Shader Arrs. per Eng.: 2 WatchPts on Addr. Ranges:4 Coherent Host Access: FALSE Features: KERNEL_DISPATCH Fast F16 Operation: TRUE Wavefront Size: 32(0x20) Workgroup Max Size: 1024(0x400) Workgroup Max Size per Dimension: x 1024(0x400) y 1024(0x400) z 1024(0x400) Max Waves Per CU: 32(0x20) Max Work-item Per CU: 1024(0x400) Grid Max Size: 4294967295(0xffffffff) Grid Max Size per Dimension: x 4294967295(0xffffffff) y 4294967295(0xffffffff) z 4294967295(0xffffffff) Max fbarriers/Workgrp: 32 Packet Processor uCode:: 550 SDMA engine uCode:: 19 IOMMU Support:: None Pool Info: Pool 1 Segment: GLOBAL; FLAGS: COARSE GRAINED Size: 25149440(0x17fc000) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Recommended Granule:2048KB Alloc Alignment: 4KB Accessible by all: FALSE Pool 2 Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED Size: 25149440(0x17fc000) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Recommended Granule:2048KB Alloc Alignment: 4KB Accessible by all: FALSE Pool 3 Segment: GROUP Size: 64(0x40) KB Allocatable: FALSE Alloc Granule: 0KB Alloc Recommended Granule:0KB Alloc Alignment: 0KB Accessible by all: FALSE ISA Info: ISA 1 Name: amdgcn-amd-amdhsa--gfx1100 Machine Models: HSA_MACHINE_MODEL_LARGE Profiles: HSA_PROFILE_BASE Default Rounding Mode: NEAR Default Rounding Mode: NEAR Fast f16: TRUE Workgroup Max Size: 1024(0x400) Workgroup Max Size per Dimension: x 1024(0x400) y 1024(0x400) z 1024(0x400) Grid Max Size: 4294967295(0xffffffff) Grid Max Size per Dimension: x 4294967295(0xffffffff) y 4294967295(0xffffffff) z 4294967295(0xffffffff) FBarrier Max Size: 32 Done
Additional Information
see https://github.com/ROCm/ROCR-Runtime/issues/198 for the HSA enumeration API and example code for how to fix it.