[Bug]: Calling procedure passed by parameter fails in assertOnGpu()

Summary of Problem

Description: Calling the procedure passed by paremters make assertOnGpu() fail. In the following codes, test1() directly calls procedure increment() , while test2() calls it via the procedure parameter. However, assertOnGpu() in test2() fails as shown in the following compilation outputs.

The output also prints GPUFunctionCall.chpl:17: note: call to a primitive that is not fast and local. I was wondering how to make the procedure passed by parameter "fast and local".

Steps to Reproduce

Source Code:

module GPUFUnctionCall {
    use GpuDiagnostics;

    proc increment(x: real): real {
        return x + 1;
    }

    proc test1(ref A: [] real, n: int)
    {
        @assertOnGpu foreach i in 0..#n {
            A[i] = increment(A[i]);
        }
    }

    proc test2(ref A: [] real, n: int, f: proc(x: real): real) {
        @assertOnGpu foreach i in 0..#n {
            A[i] = f(A[i]);
        }
    }

    proc main() {
        on here.gpus[0] {
            const n = 128;
            var A: [0..#n] real;

            test1(A, n);
            test2(A, n, increment);

            writeln(A[5]);
        }
    }
}

Compile command: chpl --fast GPUFUnctionCall.chpl

Compilation output

GPUFunctionCall.chpl:15: In function 'test2':
GPUFunctionCall.chpl:16: error: Loop is marked with @assertOnGpu but is not eligible for execution on a GPU
GPUFunctionCall.chpl:17: note: call to a primitive that is not fast and local
  GPUFunctionCall.chpl:27: called as test2(A: [domain(1,int(64),one)] real(64), n: int(64), f: proc(x: real): real)
note: generic instantiations are underlined in the above callstack

Configuration Information

Chapel version

warning: The prototype GPU support implies --no-checks. This may impact debuggability. To suppress this warning, compile with --no-checks explicitly
chpl version 2.1.0 pre-release (0198d1f1ea)
  built with LLVM version 17.0.6
  available LLVM targets: amdgcn, r600, nvptx64, nvptx, aarch64_32, aarch64_be, aarch64, arm64_32, arm64, x86-64, x86
Copyright 2020-2024 Hewlett Packard Enterprise Development LP
Copyright 2004-2019 Cray Inc.
(See LICENSE file for more details)

Chapel configuration

CHPL_TARGET_PLATFORM: linux64
CHPL_TARGET_COMPILER: llvm
CHPL_TARGET_ARCH: x86_64
CHPL_TARGET_CPU: native
CHPL_LOCALE_MODEL: gpu *
  CHPL_GPU: nvidia
CHPL_COMM: none
CHPL_TASKS: qthreads *
CHPL_LAUNCHER: none
CHPL_TIMERS: generic
CHPL_UNWIND: none
CHPL_MEM: jemalloc
CHPL_ATOMICS: cstdlib
CHPL_GMP: none
CHPL_HWLOC: bundled
CHPL_RE2: bundled
CHPL_LLVM: bundled *
CHPL_AUX_FILESYS: none

Thanks for filing this bug report @xianghao-wang!

The root cause of this is that we don't support virtual dispatch on GPU kernels today. Typically virtual dispatch is used for overridden methods to support polymorphism, but it is also used for first-class procedures as you use in your test2. Virtual calls involve finding a function in a table during execution time based on the class (here introduced by the compiler to wrap the first-class procedure) being used. These calls are compiler primitives at the time when we do GPU transforms, and that specific primitive is not safe for GPU execution because we need a symmetric of that table that contain __device__ versions of functions. In the short term, we can provide a better error message here. The longer term solution is to codegen all the functions in the virtual call table, and make the codegen for the primitive be aware of the GPU's virtual call table.

Meanwhile, for this case, there's a workaround that can be relatively acceptable. It relies on record-wrapped functions which we have been using before first-class procedure implementation was improved significantly:

    record incrementer {
      type t;
      proc this(x: t): t {
        return x+1;
      }
    }

    proc test3(ref A: [] real, n: int, f: incrementer(real)) {
        @assertOnGpu foreach i in 0..#n {
            A[i] = f(A[i]);
        }
    }

which can be then called with

test3(A, n, new incrementer(real));

Clearly, less than ideal (more coding, abuse of this maybe?) but hopefully manageable.

I'll also note here that even when we have support for virtual calls from GPU kernels, I expect them to be significantly slower because of the dynamic nature of the call. Function calls are already costlier on the GPU, I expect virtual calls to be even costlier.

chapel-lang / chapel