Chapel GPU segmentation fault on correct code

ahazi327 commented 11 months ago

Summary of Problem

Hi, I am noticing when I try to run any Chapel code on my Nvidia GPU (RTX3080) I am encountering segmentation faults, even when running the Jacobi.chpl code from the Chapel documentation on GPU programming. I have tested my cuda installation and have confirmed that it works, yet any Chapel code I run on a GPU will always run into a segmentation error.

Steps to Reproduce

Source Code: The source code I am using is from the 'test/gpu/native/jacobi/jacobi.chpl' on the Chapel repositiory.

Compile command:

chpl jacobi.chpl

Execution command:

./jacobi

Associated Future Test(s): https://github.com/chapel-lang/chapel/blob/main/test/gpu/native/jacobi/jacobi.chpl

Configuration Information

I have set my environment variables to the following: source util/setchplenv.bash export CHPL_LLVM=bundled export CHPL_LOCALE_MODEL=gpu export CHPL_GPU=nvidia export CUDA_PATH=/usr/local/cuda-11.8 export CHPL_CUDA_PATH=/usr/local/cuda-11.8 export CHPL_GPU_MEM_STRATEGY=unified_memory

When running $CHPL_HOME/util/printchplenv --anonymize CHPL_TARGET_PLATFORM: linux64 CHPL_TARGET_COMPILER: llvm CHPL_TARGET_ARCH: x86_64 CHPL_TARGET_CPU: native CHPL_LOCALE_MODEL: gpu CHPL_GPU: nvidia CHPL_COMM: none CHPL_TASKS: qthreads CHPL_LAUNCHER: none CHPL_TIMERS: generic CHPL_UNWIND: none CHPL_MEM: jemalloc CHPL_ATOMICS: cstdlib CHPL_GMP: bundled CHPL_HWLOC: bundled CHPL_RE2: bundled CHPL_LLVM: bundled * CHPL_AUX_FILESYS: none

gcc version is 11.4.0 clang version is 14.0.0 nvcc version is 11.8

Output of chpl --version warning: The prototype GPU support implies --no-checks. This may impact debuggability. To suppress this warning, compile with --no-checks explicitly chpl version 1.32.0 built with LLVM version 15.0.7 available LLVM targets: amdgcn, r600, nvptx64, nvptx, aarch64_32, aarch64_be, aarch64, arm64_32, arm64, x86-64, x86 Copyright 2020-2023 Hewlett Packard Enterprise Development LP Copyright 2004-2019 Cray Inc. (See LICENSE file for more details)

e-kayrakli commented 11 months ago

Thanks for the bug report @ahazi327.

All that you have posted above looks OK to me. You are building your Chapel with this config, correct? IOW, you can't set CHPL_LOCALE_MODEL=gpu and run chpl right away? I just wanted to rule that out.

To collect more data:

I am interpreting that this is an execution time segfault, right? So, the compiler runs fine?
Could you run ./jacobi --debugGpu and put the output in a file and post it here?
Could you run printchplenv --internal --all --anonymize and post the output?
export CHPL_GPU_MEM_STRATEGY=unified_memory is a non-default mode. It should work as we nightly-test it, but the performance may not be ideal. And arguably it is less tested compared to the default array_on_device. It might be interesting to see if using array_on_device causes any difference in behavior. You can do that by just unset CHPL_GPU_MEM_STRATEGY -- you don't have to literally set it. But then you have to rebuild your runtime make -C runtime clean && make -C runtime should do it.

ahazi327 commented 11 months ago

Yes, This is an execution time segfault, so the compiler works fine when compiling the CHPL code.
jacobi output.txt
printchplenv.txt
So when using using CHPL_GPU_MEM_STRATEGY=array_on_device, or just not setting it explicitly it will run the GPU code properly, but if initially launch my terminal and set my Chapel environment variables as seen in my initial post, I will get the segmentation fault. The interesting thing is that if i start with array_on_device then set it to unified_memory I sometimes do not get segmentation faults. When testing this I was getting inconsistent outputs, where I sometimes got segmentation faults and sometimes did not. test output.txt The file I am testing with has the exact same contents as the jacobi test file.

e-kayrakli commented 11 months ago

jacobi output.txt

This one looks like correct execution to me.

printchplenv.txt

Nothing awkward in this one either.

So when using using CHPL_GPU_MEM_STRATEGY=array_on_device, or just not setting it explicitly it will run the GPU code properly, but if initially launch my terminal and set my Chapel environment variables as seen in my initial post, I will get the segmentation fault. The interesting thing is that if i start with array_on_device then set it to unified_memory I sometimes do not get segmentation faults. When testing this I was getting inconsistent outputs, where I sometimes got segmentation faults and sometimes did not.

I think this is the issue here. When swapping memory strategies, you must rebuild your runtime. Memory strategy is not something we see as an option you should be playing around with a lot. Based on the system you're running, you might want unified_memory, but if that's the case, you must build Chapel with that environment variable set, and still keep it set while using chpl. We probably should provide a better error message for this case when the runtime is not build for the environment that's currently set (we do have similar error messages in other cases).

My guess is that you build your Chapel without setting this environment variable, which defaulted to array_on_device. But then your subsequent uses of chpl had it set it unified_memory. I have no idea what would happen if you do that and a segfault is definitely likely.

If you do want to use unified_memory, the easiest solution for you is make -C runtime clean && make -C runtime to rebuild the runtime with the environment you currently have. But as I said array_on_device is the default and is expected to perform better than unified_memory in many cases that we are aware of and test.

bradcray commented 11 months ago

We probably should provide a better error message for this case when the runtime is not build for the environment that's currently set (we do have similar error messages in other cases).

Given the behavior when getting it wrong, this sounds attractive to me. Presumably, this would be a "simple" switch to the chplenv scripts that compute the library paths, is that right? And/or, maybe in the short-term it would be easy to create some sort of mismatch message when the wires are crossed? (but maybe that's almost as much work as the real fix).

e-kayrakli commented 11 months ago

Presumably, this would be a "simple" switch to the chplenv scripts that compute the library paths, is that right?

That sounds right. Admittedly, I don't have a good intuition as to what should be a "path" variable and what shouldn't. We do differentiate nvidia/amd in paths, for example. But this feels a bit of an overkill to me. My argument is that it is a slippery slope where we might end up with a ton of different runtime paths due to combinatorial explosion. I am not sure if that's a defendable standpoint, though.

As a more context for where I am: I see unified_memory as an experimental thing (even though that's where we started). It performs worse across the board, and probably will not get along well with comm layers when we start working on GPU-driven communication. We still keep it, because the landscape may shift with newer architectures where the line between host and device memories are more blurry, and unified_memory might represent something closer to what the actual hardware looks like.

And/or, maybe in the short-term it would be easy to create some sort of mismatch message when the wires are crossed? (but maybe that's almost as much work as the real fix).

This is probably easy to do if we're OK with doing that at application launch time. Currently, only the runtime binary can know what it was build with. So, we need to start running it before we can raise the flag.

bradcray commented 11 months ago

This is probably easy to do if we're OK with doing that at application launch time. Currently, only the runtime binary can know what it was build with.

That seems acceptable (and definitely better than the status quo).

As a potential alternative, I think the GASNet team has a strategy where they do something like put different static variables into different libraries and then rely on them at link time to move such errors from execution-time to link-time. But I can't recall offhand whether they generate elegant error messages or just rely on the variable being named something like _Runtime_built_for_unified_memory_Rebuild_to_link_against_array_on_device to convey the message.

chapel-lang / chapel