chapel-lang / chapel

a Productive Parallel Programming Language
https://chapel-lang.org
Other
1.79k stars 420 forks source link

Compiler error UTI-MIS-1041 while trying to compile AMD GPU code #22754

Closed wjhorne closed 1 year ago

wjhorne commented 1 year ago

Summary of Problem

I currently get the following error when attempting to compile GPU programs using rocm on an AMD workstation

internal error: UTI-MIS-1041 chpl version 1.32.0 pre-release (048e735b27)

Compiling works using CHPL_GPU=cpu. I checked the LLVM clang install that is used and verified that it can compile and run HIP programs without issue.

Steps to Reproduce

Source Code: Any chapel source code regardless if any GPU related modules are used.

Compile command: CHPL_GPU=amd CHPL_GPU_ARCH=gfx1035 chpl jacobi.chpl

Configuration Information

Output of chpl --version

warning: The prototype GPU support implies --no-checks. This may impact debuggability. To suppress this warning, compile with --no-checks explicitly
chpl version 1.32.0 pre-release (048e735b27)
  built with LLVM version 15.0.7
  available LLVM targets: amdgcn, r600, aarch64_32, aarch64_be, aarch64, arm64_32, arm64, x86-64, x86
Copyright 2020-2023 Hewlett Packard Enterprise Development LP
Copyright 2004-2019 Cray Inc.

Output of printchplenv

CHPL_TARGET_PLATFORM: linux64
CHPL_TARGET_COMPILER: llvm +
CHPL_TARGET_ARCH: x86_64
CHPL_TARGET_CPU: native +
CHPL_LOCALE_MODEL: gpu +
CHPL_COMM: none
CHPL_TASKS: qthreads
CHPL_LAUNCHER: none
CHPL_TIMERS: generic
CHPL_UNWIND: none
CHPL_MEM: jemalloc
CHPL_ATOMICS: cstdlib
CHPL_GMP: bundled
CHPL_HWLOC: bundled
CHPL_RE2: bundled
CHPL_LLVM: bundled +
CHPL_AUX_FILESYS: none

clang --version

clang version 15.0.7 (https://github.com/chapel-lang/chapel.git 048e735b27ad7895d7a54d832585bb7570c660e2)
Target: x86_64-unknown-linux-gnu
Thread model: posix
e-kayrakli commented 1 year ago

@wjhorne -- thanks for filing this bug report. It looks like something in AMD binary generation is going wrong, probably because something in your system's installation is not quite what our scripts expect to see. But I can't pinpoint without further information. Could you:

  1. printchplenv --all --internal and paste the result
  2. which hipcc and paste the result
  3. Compile with --devel and paste the error message you get from that
  4. ls $CHPL_ROCM_PATH/amdgcn/bitcode/*bc. Note that $CHPL_ROCM_PATH will not be set in the environment, but you'll see that in the output of (1).

I am also tagging @stonea as he knows AMD quirks much better than myself.

wjhorne commented 1 year ago

printchplenv machine info: Linux chameleon 6.4.3-arch1-1 #1 SMP PREEMPT_DYNAMIC Tue, 11 Jul 2023 05:13:39 +0000 x86_64 CHPL_HOME: /home/nix/Code/languages/chapel script location: /home/nix/Code/languages/chapel/util/chplenv CHPL_HOST_PLATFORM: linux64 CHPL_HOST_COMPILER: gnu CHPL_HOST_CC: gcc CHPL_HOST_CXX: g++ CHPL_HOST_BUNDLED_COMPILE_ARGS: -I/home/nix/Code/languages/chapel/third-party/llvm/install/linux64-x86_64/include -std=c++14 -fno-exceptions -fno-rtti -D_GNU_SOURCE -DSTDC_CONSTANT_MACROS -DSTDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -Wno-comment -DHAVE_LLVM -I/home/nix/Code/languages/chapel/third-party/jemalloc/install/host/linux64-x86_64-gnu/include CHPL_HOST_SYSTEM_COMPILE_ARGS: CHPL_HOST_BUNDLED_LINK_ARGS: -L/home/nix/Code/languages/chapel/third-party/llvm/install/linux64-x86_64/lib -Wl,-rpath,/home/nix/Code/languages/chapel/third-party/llvm/install/linux64-x86_64/lib -lclangFrontend -lclangSerialization -lclangDriver -lclangCodeGen -lclangParse -lclangSema -lclangAnalysis -lclangEdit -lclangASTMatchers -lclangAST -lclangLex -lclangBasic -lclangSupport -L/home/nix/Code/languages/chapel/third-party/llvm/install/linux64-x86_64/lib -lLLVM-15 -L/home/nix/Code/languages/chapel/third-party/jemalloc/install/host/linux64-x86_64-gnu/lib -ljemalloc CHPL_HOST_SYSTEM_LINK_ARGS: -lm -lpthread CHPL_HOST_ARCH: x86_64 CHPL_HOST_CPU: none CHPL_TARGET_PLATFORM: linux64 CHPL_TARGET_COMPILER: llvm + CHPL_TARGET_CC: /home/nix/Code/languages/chapel/third-party/llvm/install/linux64-x86_64/bin/clang CHPL_TARGET_CXX: /home/nix/Code/languages/chapel/third-party/llvm/install/linux64-x86_64/bin/clang++ CHPL_TARGET_COMPILER_PRGENV: none CHPL_TARGET_BUNDLED_COMPILE_ARGS: -I/home/nix/Code/languages/chapel/runtime/include/localeModels/gpu -I/home/nix/Code/languages/chapel/runtime/include/localeModels -I/home/nix/Code/languages/chapel/runtime/include/comm/none -I/home/nix/Code/languages/chapel/runtime/include/comm -I/home/nix/Code/languages/chapel/runtime/include/tasks/qthreads -I/home/nix/Code/languages/chapel/runtime/include -I/home/nix/Code/languages/chapel/runtime/include/qio -I/home/nix/Code/languages/chapel/runtime/include/atomics/cstdlib -I/home/nix/Code/languages/chapel/runtime/include/mem/jemalloc -I/home/nix/Code/languages/chapel/third-party/utf8-decoder -DHAS_GPU_LOCALE -I/home/nix/Code/languages/chapel/runtime/include/gpu/amd -DCHPL_JEMALLOC_PREFIX=chplje -I/home/nix/Code/languages/chapel/third-party/gmp/install/linux64-x86_64-native-llvm-none/include -I/home/nix/Code/languages/chapel/third-party/hwloc/install/linux64-x86_64-native-llvm-none-gpu/include -I/home/nix/Code/languages/chapel/third-party/qthread/install/linux64-x86_64-native-llvm-none-gpu-jemalloc-bundled/include -I/home/nix/Code/languages/chapel/third-party/jemalloc/install/target/linux64-x86_64-native-llvm-none/include -I/home/nix/Code/languages/chapel/third-party/re2/install/linux64-x86_64-native-llvm-none/include CHPL_TARGET_SYSTEM_COMPILE_ARGS: -isystem/opt/hip/include -isystem/opt/hsa/include CHPL_TARGET_LD: /home/nix/Code/languages/chapel/third-party/llvm/install/linux64-x86_64/bin/clang++ CHPL_TARGET_BUNDLED_LINK_ARGS: -L/home/nix/Code/languages/chapel/lib/linux64/llvm/x86_64/cpu-native/loc-gpu/gpu-amd/gpu_mem-unified_memory/comm-none/tasks-qthreads/tmr-generic/unwind-none/mem-jemalloc/atomics-cstdlib/hwloc-bundled/re2-bundled/fs-none/lib_pic-none/san-none -lchpl -L/home/nix/Code/languages/chapel/third-party/gmp/install/linux64-x86_64-native-llvm-none/lib -lgmp -Wl,-rpath,/home/nix/Code/languages/chapel/third-party/gmp/install/linux64-x86_64-native-llvm-none/lib -L/home/nix/Code/languages/chapel/third-party/hwloc/install/linux64-x86_64-native-llvm-none-gpu/lib -lhwloc -Wl,-rpath,/home/nix/Code/languages/chapel/third-party/hwloc/install/linux64-x86_64-native-llvm-none-gpu/lib -L/home/nix/Code/languages/chapel/third-party/qthread/install/linux64-x86_64-native-llvm-none-gpu-jemalloc-bundled/lib -Wl,-rpath,/home/nix/Code/languages/chapel/third-party/qthread/install/linux64-x86_64-native-llvm-none-gpu-jemalloc-bundled/lib -lqthread -lchpl -L/home/nix/Code/languages/chapel/third-party/jemalloc/install/target/linux64-x86_64-native-llvm-none/lib -ljemalloc -L/home/nix/Code/languages/chapel/third-party/re2/install/linux64-x86_64-native-llvm-none/lib -lre2 -Wl,-rpath,/home/nix/Code/languages/chapel/third-party/re2/install/linux64-x86_64-native-llvm-none/lib CHPL_TARGET_SYSTEM_LINK_ARGS: -L/opt/lib -Wl,-rpath,/opt/lib -lamdhip64 -lhsa-runtime64 -lnuma -lm -lpthread CHPL_TARGET_ARCH: x86_64 CHPL_TARGET_CPU: native + CHPL_RUNTIME_CPU: native CHPL_TARGET_CPU_FLAG: arch CHPL_TARGET_BACKEND_CPU: native CHPL_LOCALE_MODEL: gpu + CHPL_GPU: amd + CHPL_GPU_ARCH: gfx1035 CHPL_GPU_MEM_STRATEGY: unified_memory CHPL_ROCM_PATH: /opt CHPL_COMM: none CHPL_TASKS: qthreads CHPL_LAUNCHER: none CHPL_TIMERS: generic CHPL_UNWIND: none CHPL_HOST_MEM: jemalloc CHPL_HOST_JEMALLOC: bundled CHPL_MEM: jemalloc CHPL_TARGET_MEM: jemalloc CHPL_TARGET_JEMALLOC: bundled CHPL_MAKE: make CHPL_ATOMICS: cstdlib CHPL_GMP: bundled CHPL_GMP_IS_OVERRIDDEN: False CHPL_HWLOC: bundled CHPL_RE2: bundled CHPL_RE2_IS_OVERRIDDEN: False CHPL_LLVM: bundled + CHPL_LLVM_SUPPORT: bundled CHPL_LLVM_CONFIG: /home/nix/Code/languages/chapel/third-party/llvm/install/linux64-x86_64/bin/llvm-config CHPL_LLVM_VERSION: 15 CHPL_LLVM_CLANG_C: /home/nix/Code/languages/chapel/third-party/llvm/install/linux64-x86_64/bin/clang CHPL_LLVM_CLANG_CXX: /home/nix/Code/languages/chapel/third-party/llvm/install/linux64-x86_64/bin/clang++ CHPL_LLVM_STATIC_DYNAMIC: static CHPL_LLVM_TARGET_CPU: native CHPL_AUX_FILESYS: none CHPL_LIB_PIC: none CHPL_SANITIZE: none CHPL_SANITIZE_EXE: none CHPL_RUNTIME_SUBDIR: linux64/llvm/x86_64/cpu-native/loc-gpu/gpu-amd/gpu_mem-unified_memory/comm-none/tasks-qthreads/tmr-generic/unwind-none/mem-jemalloc/atomics-cstdlib/hwloc-bundled/re2-bundled/fs-none/lib_pic-none/san-none CHPL_LAUNCHER_SUBDIR: linux64/gnu/x86_64/loc-gpu/comm-none/tasks-qthreads/launch-none/tmr-generic/unwind-none/mem-jemalloc/atomics-cstdlib/lib_pic-none/san-none CHPL_COMPILER_SUBDIR: linux64/gnu/x86_64/hostmem-jemalloc/llvm-bundled/15/san-none CHPL_HOST_BIN_SUBDIR: linux64-x86_64 CHPL_TARGET_BIN_SUBDIR: linux64-x86_64-native CHPL_SYS_MODULES_SUBDIR: linux64-x86_64-llvm CHPL_LLVM_UNIQ_CFG_PATH: linux64-x86_64 CHPL_GASNET_UNIQ_CFG_PATH: linux64-x86_64-native-llvm-none/substrate-none/seg-none CHPL_GMP_UNIQ_CFG_PATH: linux64-x86_64-native-llvm-none CHPL_HWLOC_UNIQ_CFG_PATH: linux64-x86_64-native-llvm-none-gpu CHPL_HOST_JEMALLOC_UNIQ_CFG_PATH: host/linux64-x86_64-gnu CHPL_TARGET_JEMALLOC_UNIQ_CFG_PATH: target/linux64-x86_64-native-llvm-none CHPL_LIBFABRIC_UNIQ_CFG_PATH: linux64-x86_64-native-llvm-none CHPL_LIBUNWIND_UNIQ_CFG_PATH: linux64-x86_64-native-llvm-none CHPL_QTHREAD_UNIQ_CFG_PATH: linux64-x86_64-native-llvm-none-gpu-jemalloc-bundled CHPL_RE2_UNIQ_CFG_PATH: linux64-x86_64-native-llvm-none CHPL_PE_CHPL_PKGCONFIG_LIBS:

which /opt/rocm/bin/hipcc

devel flag internal error: seg fault [util/misc.cpp:1041] rocm path CHPL_ROCM_PATH = /opt <---- This seems wrong, I think it should be /opt/rocm

output of ls /opt/rocm/amdgcn/bitcode/bc /opt/rocm/amdgcn/bitcode/asanrtl.bc /opt/rocm/amdgcn/bitcode/hip.bc /opt/rocm/amdgcn/bitcode/ockl.bc /opt/rocm/amdgcn/bitcode/oclc_abi_version_400.bc /opt/rocm/amdgcn/bitcode/oclc_abi_version_500.bc /opt/rocm/amdgcn/bitcode/oclc_correctly_rounded_sqrt_off.bc /opt/rocm/amdgcn/bitcode/oclc_correctly_rounded_sqrt_on.bc /opt/rocm/amdgcn/bitcode/oclc_daz_opt_off.bc /opt/rocm/amdgcn/bitcode/oclc_daz_opt_on.bc /opt/rocm/amdgcn/bitcode/oclc_finite_only_off.bc /opt/rocm/amdgcn/bitcode/oclc_finite_only_on.bc /opt/rocm/amdgcn/bitcode/oclc_isa_version_1010.bc /opt/rocm/amdgcn/bitcode/oclc_isa_version_1011.bc /opt/rocm/amdgcn/bitcode/oclc_isa_version_1012.bc /opt/rocm/amdgcn/bitcode/oclc_isa_version_1013.bc /opt/rocm/amdgcn/bitcode/oclc_isa_version_1030.bc /opt/rocm/amdgcn/bitcode/oclc_isa_version_1031.bc /opt/rocm/amdgcn/bitcode/oclc_isa_version_1032.bc /opt/rocm/amdgcn/bitcode/oclc_isa_version_1033.bc /opt/rocm/amdgcn/bitcode/oclc_isa_version_1034.bc /opt/rocm/amdgcn/bitcode/oclc_isa_version_1035.bc /opt/rocm/amdgcn/bitcode/oclc_isa_version_1036.bc /opt/rocm/amdgcn/bitcode/oclc_isa_version_1100.bc /opt/rocm/amdgcn/bitcode/oclc_isa_version_1101.bc /opt/rocm/amdgcn/bitcode/oclc_isa_version_1102.bc /opt/rocm/amdgcn/bitcode/oclc_isa_version_1103.bc /opt/rocm/amdgcn/bitcode/oclc_isa_version_600.bc /opt/rocm/amdgcn/bitcode/oclc_isa_version_601.bc /opt/rocm/amdgcn/bitcode/oclc_isa_version_602.bc /opt/rocm/amdgcn/bitcode/oclc_isa_version_700.bc /opt/rocm/amdgcn/bitcode/oclc_isa_version_701.bc /opt/rocm/amdgcn/bitcode/oclc_isa_version_702.bc /opt/rocm/amdgcn/bitcode/oclc_isa_version_703.bc /opt/rocm/amdgcn/bitcode/oclc_isa_version_704.bc /opt/rocm/amdgcn/bitcode/oclc_isa_version_705.bc /opt/rocm/amdgcn/bitcode/oclc_isa_version_801.bc /opt/rocm/amdgcn/bitcode/oclc_isa_version_802.bc /opt/rocm/amdgcn/bitcode/oclc_isa_version_803.bc /opt/rocm/amdgcn/bitcode/oclc_isa_version_805.bc /opt/rocm/amdgcn/bitcode/oclc_isa_version_810.bc /opt/rocm/amdgcn/bitcode/oclc_isa_version_900.bc /opt/rocm/amdgcn/bitcode/oclc_isa_version_902.bc /opt/rocm/amdgcn/bitcode/oclc_isa_version_904.bc /opt/rocm/amdgcn/bitcode/oclc_isa_version_906.bc /opt/rocm/amdgcn/bitcode/oclc_isa_version_908.bc /opt/rocm/amdgcn/bitcode/oclc_isa_version_909.bc /opt/rocm/amdgcn/bitcode/oclc_isa_version_90a.bc /opt/rocm/amdgcn/bitcode/oclc_isa_version_90c.bc /opt/rocm/amdgcn/bitcode/oclc_isa_version_940.bc /opt/rocm/amdgcn/bitcode/oclc_unsafe_math_off.bc /opt/rocm/amdgcn/bitcode/oclc_unsafe_math_on.bc /opt/rocm/amdgcn/bitcode/oclc_wavefrontsize64_off.bc /opt/rocm/amdgcn/bitcode/oclc_wavefrontsize64_on.bc /opt/rocm/amdgcn/bitcode/ocml.bc /opt/rocm/amdgcn/bitcode/opencl.bc

e-kayrakli commented 1 year ago

OK this is unfortunate and your assessment is correct -- CHPL_ROCM_PATH is clearly wrong there.

Luckily, you should be able to set it manually via export CHPL_ROCM_PATH=/opt/rocm and you should be good to go. This is documented under https://chapel-lang.org/docs/main/technotes/gpu.html#vendor-portability, but if you have any suggestions on improving that we'd appreciate it.

My theory about the problem is that the path you get from which hipcc is a symlink that eventually points at something more complicated. realpath $(which hipcc) works on my system and follows symlinks to the end. If realpath doesn't work, you can use ls -l to follow the links manually. I'd be interested in seeing that path.

Our current heuristic is that we peel off 3 parts of the path from the real path of hipcc to find the rocm root. Maybe your hipcc is not that deep for some reason? This heuristic worked on several machines we tested so far, but apparently not very universal.

One of the recent discussions we've been having was about including CHPL_ROCM_PATH/CHPL_CUDA_PATH in printchplenv output without any flags. It might make this issue more visible. We could probably improve our heuristics to find ROCm path correctly in your case, but there'll always be another system with a unique installation. It is to our benefit to expose these knobs to the users.

wjhorne commented 1 year ago

I think setting the path is simple enough once I knew that I had to export it on compilation, not just building. Unfortunately it looks like that helped me get past one error into another. I now am getting

lld: error: undefined symbol: __oclc_ABI_version

referenced by /tmp/chpl-nix.deleteme-t9hXMj/chplgpu.o:(ockl_hostcall_preview) referenced by /tmp/chpl-nix.deleteme-t9hXMj/chplgpu.o:(ockl_hostcall_preview)

wjhorne commented 1 year ago

It looks like it might be an issue with clang/hipcc itself. I found one comment in a different repo recommending to add "-Xclang -mlink-bitcode-file -Xclang /rocm/install/path/amdgcn/bitcode/oclc_abi_version_400.bc" to the clang calls. Is there a way I can test this out during compiling of Chapel?

e-kayrakli commented 1 year ago

That's interesting, we don't link to that bc library indeed.

Could you try passing --ccflags "-Xclang -mlink-bitcode-file -Xclang /rocm/install/path/amdgcn/bitcode/oclc_abi_version_400.bc". If quotes don't work, you should be able to pass every "word" preceded by --ccflags individually. This should normally pass flags to Clang directly.

There's a special handling in our compiler for bitcode library linking, and I am unsure whether --ccflags could do something equivalent. If it doesn't work, could you try patching your Chapel with:

diff --git a/compiler/llvm/clangUtil.cpp b/compiler/llvm/clangUtil.cpp
index fa0bef8b43..92bcc12950 100644
--- a/compiler/llvm/clangUtil.cpp
+++ b/compiler/llvm/clangUtil.cpp
@@ -4259,6 +4259,7 @@ static void linkGpuDeviceLibraries() {
     linkBitCodeFile((libPath + "/oclc_finite_only_off.bc").c_str());
     linkBitCodeFile((libPath + "/oclc_correctly_rounded_sqrt_on.bc").c_str());
     linkBitCodeFile((libPath + "/oclc_wavefrontsize64_on.bc").c_str());
+    linkBitCodeFile((libPath + "/oclc_abi_version_400.bc").c_str());
     linkBitCodeFile(determineOclcVersionLib(libPath).c_str());
   }

and rebuild the compiler?

@stonea -- do you know what kind of bitcode libraries end up in that path? Should we just link them all (especially so, if this solution works)? At the very least we should check for some other files, maybe.

wjhorne commented 1 year ago

With your patch I was able to pass the error and I am now to an error that reads the following

internal error: gpu-amd.c:62: Error calling HIP function: no kernel image is available for execution on the device (Code: 209)

From what I know this usually means that the correct architecture isn't being targeted. For my system I need something akin to "--offload-arch=gfx1032,gfx1035" due to the presence of a discrete GPU (gfx1035) and one attached to the processor (gfx1032). When I attempt to do CHPL_GPU_ARCH=gfx1032,gfx1035 I get

/opt/rocm/llvm/bin/clang-offload-bundler: warning: -inputs is deprecated, use -input instead /opt/rocm/llvm/bin/clang-offload-bundler: warning: -outputs is deprecated, use -output instead /opt/rocm/llvm/bin/clang-offload-bundler: error: number of input files and targets should match in bundling mode error: .out file to fatbin file

Targeting gfx1032 or gfx1035 produces the first error.

stonea commented 1 year ago

@stonea -- do you know what kind of bitcode libraries end up in that path? Should we just link them all (especially so, if this solution works)? At the very least we should check for some other files, maybe.

'ocml.bc' and 'ockl.bc' are the main things. This is configured by linking to a number of other .bc files to turn various features on/off. Documented here:

https://github.com/RadeonOpenCompute/ROCm-Device-Libs/blob/amd-stg-open/doc/OCML.md#controls

So we wouldn't want to link to all .bc files in that directory (some of them have contradictory meanings).

The "abi version" one is new to me. The linkBitCodeFile(determineOclcVersionLib(libPath).c_str()); line should be linking to an oclc_isa_version_XYZ.amdgcn.bc, but given the linker error it seems like we should be linking to one of them (just strange we haven't encountered this ourselves yet).

Edit: also note if you run hipcc with -### you can see what it's linking against.

e-kayrakli commented 1 year ago

Hmm, that's a setup that we haven't tested on before. I am guessing this is a personal system as it has both integrated and discrete GPUs?

The fact that multiple architectures doesn't work with CHPL_GPU_ARCH isn't very surprising -- we don't handle that today, though we considered in the past from a portability standpoint.

Since you have already patched your Chapel, I am going to suggest what my next step would be here :). I am curious what your rocm-smi --showproductname shows. My guess is that the integrated GPU is device 0 and the discrete is device 1. Assuming that is the case, here's a hack to make the runtime ignore device 0:

diff --git a/runtime/src/gpu/amd/gpu-amd.c b/runtime/src/gpu/amd/gpu-amd.c
index 6f6afe5403..515a9ba8d4 100644
--- a/runtime/src/gpu/amd/gpu-amd.c
+++ b/runtime/src/gpu/amd/gpu-amd.c
@@ -138,7 +138,7 @@ void chpl_gpu_impl_init(int* num_devices) {
   deviceClockRates = chpl_malloc(sizeof(int)*loc_num_devices);

   int i;
-  for (i=0 ; i<loc_num_devices ; i++) {
+  for (i=1 ; i<loc_num_devices ; i++) {
     hipDevice_t device;
     hipCtx_t context;

Note that you'll need to rebuild your runtime.

The error you're getting occurs during runtime initialization where we iterate over devices and do necessary initialization, including loading the binary for the device into memory. In your case, there's no binary for device 0 (I presume). If this hack works, we can consider handling (i.e. ignoring for now) integrated GPUs in a nicer way via an environment variable of sorts.


Also, I'd be up for arranging a screenshare session to get to the bottom of the issues if this doesn't help and/or you'd find it helpful. You can reach out to me at engin@hpe.com.

wjhorne commented 1 year ago

My integrated gpu is indeed in slot 0 and the additional patch got me one step closer.

I am currently stuck on another error given as

internal error: gpu-amd.c:72: Error calling HIP function: named symbol not found (Code: 500)

Using the jacobi.chpl test case and printing out the kernel name right before the failure yields chpl_gpu_kernel_jacobi_line_37.

My time is pretty sporadic on this so not sure if a screen share is going to work out soon, but it would be nice to get this worked out. My goal of all of this is to turn on an actual GPU on some code I worked on using CHPL_GPU=cpu before trying to move to actual clusters.

e-kayrakli commented 1 year ago

My time is pretty sporadic on this so not sure if a screen share is going to work out soon, but it would be nice to get this worked out. My goal of all of this is to turn on an actual GPU on some code I worked on using CHPL_GPU=cpu before trying to move to actual clusters.

No worries. Let me know if your plan changes. I think we'll incorporate what we learn from here into our code. The only problem is lack of a system where we can nightly-test these soon-to-be-features. Your path starting cpu-as-device mode makes sense. Just the intermediate step that you're wrestling with at the moment has different parameters than actual clusters. IOW, I certainly hope things will be smoother in your final target.

On to the problem; I think we are generating the kernel, but not setting up the "ignored GPU" correctly. So, when you do on here.gpus[0], you're still targeting the integrated chip. Here's a more advanced patch that's closer to feature than a hack. You'll need to set CHPL_RT_NUM_IGNORED_GPUS=1 when launching an application. It'll skip first 1 GPUs when initializing the runtime. It'll set number of devices correctly this time, though. As you can see the patch is large this time, and my confidence in it is low. If it doesn't work, I would consider using clusters directly, if I am being frank.

ignoregpus.patch -- you need to revert the previous runtime patch.

This passes in a quick test with writeln(here.gpus.size) and I think it'll help in your case. Let me know how it goes.

wjhorne commented 1 year ago

I am happy to report that everything works if I use here.gpus[1] rather than allowing here.gpus[0] indicating that the problem is what you have indicated. I'll go through with the larger patch you have provided to produce something a bit easier to work with generally, but I think everything is solved here.

e-kayrakli commented 1 year ago

Phew, that's great to hear. I will summarize our observations and different issues we tackled before closing this issue. We can probably merge more improved version of the hacks here as well.

You may be doing this already, but the following can make your life easier when you move to clusters in case you have a ton of on here.gpus[N]:

config const nIgnored = 0;

on here.gpus[N+nIgnored] { ... } // run on GPU N

coforall gpu in here.gpus[nIgnored..] do on gpu { ... } // run on all GPUs except 
                                                        // first `nIgnored`

You can set --nIgnored=1 on your current system when running your application, and drop that argument when running on an actual cluster. Once you're done porting to the cluster, it should be relatively easy to rip out nIgnored compared to hard-coding magic numbers in your code. (Or keep it in if that's helpful, obviously)

e-kayrakli commented 1 year ago

OK, I think I've distilled several issues that came up here.

  1. Incorrect path was the first issue: https://github.com/chapel-lang/chapel/issues/22780
  2. Then, there was a missing bc linkage: https://github.com/chapel-lang/chapel/issues/22781
  3. If we had the ability to compile for multiple architectures, we may have went further in the experiment: https://github.com/chapel-lang/chapel/issues/22783
  4. Though that doesn't mean we would handle a setup with integrated+discrete GPUs nicely: https://github.com/chapel-lang/chapel/issues/22782

@wjhorne am I missing something here? My intention is to close this issue as there's no further action that's needed here and it sprawled quite a bit. All of the above links to this one as I believe the context will be important going forward. Does that make sense?

wjhorne commented 1 year ago

The only thing I would add is that the current method that /util/chplenv/chpl_gpu.py uses to determine ROCm version has the same /opt/rocm issue I encountered even when I attempted to set CHPL_ROCM_PATH. I ended up hacking in the correct version to pass the issue, but ideally it would find /opt/rocm/.info/version correctly for cases like mine.

Thanks for all the effort and quick replies here. I am having a largely positive experience with Chapel so far and am glad to see the support side is so strong.

bradcray commented 1 year ago

Thanks for saying so, Wyatt, and thanks for the quick responses and actions, Engin!

e-kayrakli commented 1 year ago

The only thing I would add is that the current method that /util/chplenv/chpl_gpu.py uses to determine ROCm version has the same /opt/rocm issue I encountered even when I attempted to set CHPL_ROCM_PATH. I ended up hacking in the correct version to pass the issue, but ideally it would find /opt/rocm/.info/version correctly for cases like mine.

Posted a comment about this here: https://github.com/chapel-lang/chapel/issues/22780#issuecomment-1644698169

Please feel free to subscribe to those issues or comment under them if I mischaracterized anything.

Thanks again for the bug report! Closing this issue.

e-kayrakli commented 1 year ago

Hi @wjhorne -- In case you missed it, we were able to solve 2 of the problems I listed above in version 1.32 (released last week).

The main blocker for you is the mix of integrated+discrete GPUs, which still remains unresolved. But I was wondering if you were able to make progress in your experiments and was able to run an HPC system where the problem hopefully won't arise.

Meanwhile, we also received requests that are not identical to your case, but probably would require a solution that can help with your case, too. I captured those issues and some ideas going forward in https://github.com/chapel-lang/chapel/issues/23535. Feel free to comment under it if you have any thoughts.

wjhorne commented 1 year ago

I have been watching as things have progressed and am glad that there has been so much progress! I was able to run on my discrete + integrated setup using the hacks that were discussed here. I was also able to run on a cluster, but ran into cluster teething issues that are not at all Chapel related.

I'll continue to watch as more changes come in. From my end, any work that makes the capability more portable between various machines, desktops/clusters, is greatly appreciated. It is one of the strong advantages of c++, along with a very healthy dose of inertia, right now when mixed with stuff like Kokkos or RAJA for HPC.