forall reductions fail silently for GPU

milthorpe commented 1 year ago

Summary of Problem

As documented in #23324, reduce intents are not currently supported for GPU loops, however, rather than generating a compile error, they generate code which works incorrectly. In the example below, a sum reduction returns a result of 0 instead of the expected result.

Steps to Reproduce

Source Code: gpusum.chpl

use GPU;

proc computeSum(loc) {
    on loc {
        var sum: int = 0;
        forall a in 0..#10 with (+ reduce sum) {
            sum += a;
        }
        writeln(sum);
    }
}

if here.gpus.size > 0 then computeSum(here.gpus[0]);
else computeSum(here);

Compile command:

for CPU: chpl gpusum.chpl for GPU on NVIDIA: CHPL_LOCALE_MODEL=gpu CHPL_GPU=nvidia chpl gpusum.chpl

Execution command:

./gpusum

On CPU, this prints "45"; on GPU, it prints "0".

Configuration Information

Output of chpl --version:

chpl version 1.32.0
built with LLVM version 15.0.7
available LLVM targets: amdgcn, r600, nvptx64, nvptx, aarch64_32, aarch64_be, aarch64, arm64_32, arm64, x86-64, x86
Copyright 2020-2023 Hewlett Packard Enterprise Development LP
Copyright 2004-2019 Cray Inc.
(See LICENSE file for more details)

Output of $CHPL_HOME/util/printchplenv --anonymize:

CHPL_TARGET_COMPILER: llvm
CHPL_TARGET_ARCH: x86_64
CHPL_TARGET_CPU: native
CHPL_LOCALE_MODEL: flat
CHPL_COMM: none
CHPL_TASKS: qthreads
CHPL_LAUNCHER: none
CHPL_TIMERS: generic
CHPL_UNWIND: none
CHPL_MEM: jemalloc
CHPL_ATOMICS: cstdlib
CHPL_GMP: bundled
CHPL_HWLOC: bundled
CHPL_RE2: bundled
CHPL_LLVM: bundled
CHPL_AUX_FILESYS: none

or

Warning: CHPL_GPU_CODEGEN is deprecated and now ignored. Please use 'CHPL_GPU=[nvidia|amd|cpu]' to choose a GPU target explicitly.
CHPL_TARGET_PLATFORM: linux64
CHPL_TARGET_COMPILER: llvm
CHPL_TARGET_ARCH: x86_64
CHPL_TARGET_CPU: native
CHPL_LOCALE_MODEL: gpu *
CHPL_GPU: nvidia *
CHPL_COMM: none
CHPL_TASKS: qthreads
CHPL_LAUNCHER: none
CHPL_TIMERS: generic
CHPL_UNWIND: none
CHPL_MEM: jemalloc
CHPL_ATOMICS: cstdlib
CHPL_GMP: bundled
CHPL_HWLOC: bundled
CHPL_RE2: bundled
CHPL_LLVM: bundled
CHPL_AUX_FILESYS: none

bradcray commented 12 months ago

Thanks for taking the time to file this, Josh!

e-kayrakli commented 11 months ago

I am trying to figure out what the right course of action is here. With the GPU locale model, I expect such reductions to work if the forall executes on the host. So, a compiler error might be too big of a hammer. We could consider issuing a compiler warning and an execution time error if the loop ends up running in a GPU sublocale. Would that be too subtle, or potentially frustrating if you ignore the compiler warning only to see an execution time error?

bradcray commented 11 months ago

Or would we want to always run forall loops with reduce intents on the CPU until such time as they are working properly on GPUs?

e-kayrakli commented 11 months ago

Or would we want to always run forall loops with reduce intents on the CPU until such time as they are working properly on GPUs?

Feels like an obvious thing to do that I apparently couldn't think of..

bradcray commented 11 months ago

I didn't either until you reminded me of GPU code falling back to the CPU when not possible to GPUize on another issue yesterday.

stonea commented 11 months ago

Or would we want to always run forall loops with reduce intents on the CPU until such time as they are working properly on GPUs?

I've merged a quick PR that changes things so that this is now what we'll do: https://github.com/chapel-lang/chapel/pull/23931

Of course long term we'll want to support this on the gpu.

chapel-lang / chapel