Device function pointers

Right, so simply put. I want the following code to work:

using CUDA

f(x) = x+1

g(x) = x*2

function call_fxs!(fxs)
    x = 1
    for i = 1:length(fxs)
        x = fxs[1](x)
        @cuprintf("%g\n",x)
    end
end

@cuda threads = 1 call_fxs!((f, g))

This is what the code looks like in CUDA C:

#include <stdio.h>
typedef double (*func)(double x);

__device__ double func1(double x)
{
return x+1.0f;
}

__device__ double func2(double x)
{
return x*2.0f;
}

__device__ func pfunc1 = func1;
__device__ func pfunc2 = func2;

__global__ void test_kernel(func* f, int n)
{
  double x = 1.0;

  for(int i=0;i<n;++i){
   x=f[i](x);
   printf("%g\n",x);
  }
}

int main(void)
{
  int N = 2;

  func* h_f;
  func* d_f;

  h_f = (func*)malloc(N*sizeof(func));

  cudaMalloc((void**)&d_f,N*sizeof(func));

  cudaMemcpyFromSymbol( &h_f[0], pfunc1, sizeof(func));
  cudaMemcpyFromSymbol( &h_f[1], pfunc2, sizeof(func));

  cudaMemcpy(d_f,h_f,N*sizeof(func),cudaMemcpyHostToDevice);

  test_kernel<<<1,1>>>(d_f,N);

  cudaFree(d_f);
  free(h_f);

  return 0;
}

[jars@node0024 ~]$ nvcc check.cu 
[jars@node0024 ~]$ ./a.out 
2
4

I've been banging my head against it for a long time (a few months before this post: https://github.com/leios/Fable.jl/pull/64#issuecomment-1501858745)

My current solution involves @generated loops on loops, which ends up generating functions that are quite large and take a significant amount of time (sometimes up to 70 s for a kernel that runs in 0.0001 s). Mentioned here: https://discourse.julialang.org/t/is-there-any-good-way-to-call-functions-from-a-set-of-functions-in-a-cuda-kernel/102051/3?u=leios

Solutions that exist in other languages:

GLSL / OpenCL: The user compiles shaders / kernels at runtime, so they can be spun up in the background relatively quickly. Somehow, this is much faster than doing essentially the same thing in Julia.
CUDA: Just use fx pointers bro (though I did have to do my own AST solve for certain workflows)

I have had this discussion throughout the years with @vchuravy , @jpsamaroo , and @maleadt, but never documented it because I'm apparently the only one actually hitting the issue.

To be honest, I think we are approaching something that might not be fundamentally possible with Julia, but I would like to be able to pass in arbitrary functions to a kernel without forcing recompilation of any kind.

I am not sure if it is best to put this here or in GPUCompiler.

related discussions:

Errors:

julia> @cuda threads = 1 call_fxs!((f, g))
ERROR: InvalidIRError: compiling MethodInstance for call_fxs!(::Tuple{typeof(f), typeof(g)}) resulted in invalid LLVM IR
Reason: unsupported call to an unknown function (call to ijl_get_nth_field_checked)
Stacktrace:
 [1] getindex
   @ ./tuple.jl:31
 [2] call_fxs!
   @ ./REPL[7]:4
Hint: catch this exception as `err` and call `code_typed(err; interactive = true)` to introspect the erronous code with Cthulhu.jl
Stacktrace:
  [1] check_ir(job::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, args::LLVM.Module)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/nWT2N/src/validation.jl:147
  [2] macro expansion
    @ ~/.julia/packages/GPUCompiler/nWT2N/src/driver.jl:460 [inlined]
  [3] macro expansion
    @ ~/.julia/packages/TimerOutputs/Lw5SP/src/TimerOutput.jl:253 [inlined]
  [4] macro expansion
    @ ~/.julia/packages/GPUCompiler/nWT2N/src/driver.jl:459 [inlined]
  [5] 
    @ GPUCompiler ~/.julia/packages/GPUCompiler/nWT2N/src/utils.jl:103
  [6] emit_llvm
    @ ~/.julia/packages/GPUCompiler/nWT2N/src/utils.jl:97 [inlined]
  [7] 
    @ GPUCompiler ~/.julia/packages/GPUCompiler/nWT2N/src/driver.jl:136
  [8] codegen
    @ ~/.julia/packages/GPUCompiler/nWT2N/src/driver.jl:115 [inlined]
  [9] 
    @ GPUCompiler ~/.julia/packages/GPUCompiler/nWT2N/src/driver.jl:111
 [10] compile
    @ ~/.julia/packages/GPUCompiler/nWT2N/src/driver.jl:103 [inlined]
 [11] #1145
    @ ~/.julia/packages/CUDA/75aiI/src/compiler/compilation.jl:254 [inlined]
 [12] JuliaContext(f::CUDA.var"#1145#1148"{GPUCompiler.CompilerJob{…}}; kwargs::@Kwargs{})
    @ GPUCompiler ~/.julia/packages/GPUCompiler/nWT2N/src/driver.jl:52
 [13] JuliaContext(f::Function)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/nWT2N/src/driver.jl:42
 [14] compile(job::GPUCompiler.CompilerJob)
    @ CUDA ~/.julia/packages/CUDA/75aiI/src/compiler/compilation.jl:253
 [15] actual_compilation(cache::Dict{…}, src::Core.MethodInstance, world::UInt64, cfg::GPUCompiler.CompilerConfig{…}, compiler::typeof(CUDA.compile), linker::typeof(CUDA.link))
    @ GPUCompiler ~/.julia/packages/GPUCompiler/nWT2N/src/execution.jl:128
 [16] cached_compilation(cache::Dict{…}, src::Core.MethodInstance, cfg::GPUCompiler.CompilerConfig{…}, compiler::Function, linker::Function)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/nWT2N/src/execution.jl:103
 [17] macro expansion
    @ ~/.julia/packages/CUDA/75aiI/src/compiler/execution.jl:369 [inlined]
 [18] macro expansion
    @ ./lock.jl:267 [inlined]
 [19] cufunction(f::typeof(call_fxs!), tt::Type{Tuple{Tuple{typeof(f), typeof(g)}}}; kwargs::@Kwargs{})
    @ CUDA ~/.julia/packages/CUDA/75aiI/src/compiler/execution.jl:364
 [20] cufunction(f::typeof(call_fxs!), tt::Type{Tuple{Tuple{typeof(f), typeof(g)}}})
    @ CUDA ~/.julia/packages/CUDA/75aiI/src/compiler/execution.jl:361
 [21] top-level scope
    @ ~/.julia/packages/CUDA/75aiI/src/compiler/execution.jl:112
Some type information was truncated. Use `show(err)` to see complete types.

Error for similar code failing on AMDGPU:

ERROR: InvalidIRError: compiling MethodInstance for gpu_check(::KernelAbstractions.CompilerMetadata{…}, ::AMDGPU.Device.ROCDeviceVector{…}, ::Tuple{…}) resulted in invalid LLVM IR
Reason: unsupported call to an unknown function (call to ijl_get_nth_field_checked)
Stacktrace:
 [1] getindex
   @ ./tuple.jl:31
 [2] macro expansion
   @ ./REPL[6]:4
 [3] gpu_check
   @ ~/.julia/packages/KernelAbstractions/MAxUm/src/macros.jl:95
 [4] gpu_check
   @ ./none:0
Hint: catch this exception as `err` and call `code_typed(err; interactive = true)` to introspect the erronous code with Cthulhu.jl
Stacktrace:
  [1] check_ir(job::GPUCompiler.CompilerJob{…}, args::LLVM.Module)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/nWT2N/src/validation.jl:147
  [2] macro expansion
    @ ~/.julia/packages/GPUCompiler/nWT2N/src/driver.jl:460 [inlined]
  [3] macro expansion
    @ ~/.julia/packages/TimerOutputs/Lw5SP/src/TimerOutput.jl:253 [inlined]
  [4] macro expansion
    @ ~/.julia/packages/GPUCompiler/nWT2N/src/driver.jl:459 [inlined]
  [5] 
    @ GPUCompiler ~/.julia/packages/GPUCompiler/nWT2N/src/utils.jl:103
  [6] emit_llvm
    @ ~/.julia/packages/GPUCompiler/nWT2N/src/utils.jl:97 [inlined]
  [7] 
    @ GPUCompiler ~/.julia/packages/GPUCompiler/nWT2N/src/driver.jl:136
  [8] codegen
    @ ~/.julia/packages/GPUCompiler/nWT2N/src/driver.jl:115 [inlined]
  [9] 
    @ GPUCompiler ~/.julia/packages/GPUCompiler/nWT2N/src/driver.jl:111
 [10] compile
    @ ~/.julia/packages/GPUCompiler/nWT2N/src/driver.jl:103 [inlined]
 [11] #40
    @ ~/.julia/packages/AMDGPU/WqMSe/src/compiler/codegen.jl:170 [inlined]
 [12] JuliaContext(f::AMDGPU.Compiler.var"#40#41"{GPUCompiler.CompilerJob{…}}; kwargs::@Kwargs{})
    @ GPUCompiler ~/.julia/packages/GPUCompiler/nWT2N/src/driver.jl:52
 [13] JuliaContext(f::Function)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/nWT2N/src/driver.jl:42
 [14] hipcompile(job::GPUCompiler.CompilerJob)
    @ AMDGPU.Compiler ~/.julia/packages/AMDGPU/WqMSe/src/compiler/codegen.jl:169
 [15] actual_compilation(cache::Dict{…}, src::Core.MethodInstance, world::UInt64, cfg::GPUCompiler.CompilerConfig{…}, compiler::typeof(AMDGPU.Compiler.hipcompile), linker::typeof(AMDGPU.Compiler.hiplink))
    @ GPUCompiler ~/.julia/packages/GPUCompiler/nWT2N/src/execution.jl:128
 [16] cached_compilation(cache::Dict{…}, src::Core.MethodInstance, cfg::GPUCompiler.CompilerConfig{…}, compiler::Function, linker::Function)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/nWT2N/src/execution.jl:103
 [17] macro expansion
    @ ~/.julia/packages/AMDGPU/WqMSe/src/compiler/codegen.jl:137 [inlined]
 [18] macro expansion
    @ ./lock.jl:267 [inlined]
 [19] hipfunction(f::typeof(gpu_check), tt::Type{Tuple{…}}; kwargs::@Kwargs{})
    @ AMDGPU.Compiler ~/.julia/packages/AMDGPU/WqMSe/src/compiler/codegen.jl:131
 [20] hipfunction(f::typeof(gpu_check), tt::Type{Tuple{…}})
    @ AMDGPU.Compiler ~/.julia/packages/AMDGPU/WqMSe/src/compiler/codegen.jl:130
 [21] macro expansion
    @ ~/.julia/packages/AMDGPU/WqMSe/src/highlevel.jl:172 [inlined]
 [22] (::KernelAbstractions.Kernel{…})(::ROCArray{…}, ::Vararg{…}; ndrange::Int64, workgroupsize::Nothing)
    @ AMDGPU.ROCKernels ~/.julia/packages/AMDGPU/WqMSe/src/ROCKernels.jl:86
 [23] top-level scope
    @ REPL[11]:1
Some type information was truncated. Use `show(err)` to see complete types.

Just quickly documenting my recommedation here

julia> struct VTable{T}
          funcs::T
       end

julia> @generated function (VT::VTable{T})(fidx, args...) where T
           N = length(T.parameters)
           quote
              Base.Cartesian.@nif $(N+1) d->fidx==d d->return VT.funcs[d](args...) d->error("fidx oob")
           end
       end

julia> VT = VTable(((x)->x+1, (x)->x+2))
VTable{Tuple{var"#2#4", var"#3#5"}}((var"#2#4"(), var"#3#5"()))

julia> VT = VTable(((x)->x+1, (x)->x+2))^C

julia> VT(1, 2)
3

julia> VT(2, 2)
4

julia> VT(3, 2)
ERROR: fidx oob
Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:35
 [2] macro expansion
   @ ./REPL[3]:4 [inlined]
 [3] (::VTable{Tuple{var"#2#4", var"#3#5"}})(fidx::Int64, args::Int64)
   @ Main ./REPL[3]:1
 [4] top-level scope
   @ REPL[7]:1

julia> @code_typed VT(3, 2)
CodeInfo(
1 ─ %1 = (fidx === 1)::Bool
└──      goto #3 if not %1
2 ─ %3 = Core.getfield(args, 1)::Int64
│   %4 = Base.add_int(%3, 1)::Int64
└──      return %4
3 ─ %6 = (fidx === 2)::Bool
└──      goto #5 if not %6
4 ─ %8 = Core.getfield(args, 1)::Int64
│   %9 = Base.add_int(%8, 2)::Int64
└──      return %9
5 ─      invoke Main.error("fidx oob"::String)::Union{}
└──      unreachable
) => Int64

If this leads to function blowup, then one might need to use:

@generated function (VT::VTable{T})(fidx, args...) where T
                  N = length(T.parameters)
                  quote
                     Base.Cartesian.@nif $(N+1) d->fidx==d d->begin; f = VT.funcs[d]; @noinline f(args...); end d->error("fidx oob")
                  end
              end

@maleadt said the following on slack and I would like to repeat it here:

Function pointers in CUDA are tricky, because the module you compile needs to contain all code. You can't compile a kernel and pass it an arbitrary function pointer to execute, the function pointer has to refer to something in the module (hence the module lookup shenanigans). So in that sense it isn't a real functoin pointer, it's more of symbol you look up in a previously compiled module. In addition though, if you have C source code like:
__device__ int foo(int x);
__global__ void kernel(int *ptr());
... compiling those two together gives you addressable entities by just looking at the declarations. whereas in Julia:
foo(x::Int)::Int;
kernel(ptr::Function);
... the foo function is unordaned, so not taken into account when compiling GPU code, but also passing foo by doing @cuda kernel(foo) doesn't give the compiler enough information to compile an addressible version of foo together with kernel, because in Julia foo only refers to a generic function, and not to a specialized method foo(::Int)::Int. That requires inspecting the kernel for how you invoke foo, which quickly runs into other issues (what if you invoke foo with two differently-typed arguments? that would mean we need 2 versions of foo, but you're only passing a single function pointer...). In summary, lots of issues that will probably prevent us from every supporting this fully like in C.

So my understanding. There seem to be three (related) issues here:

Julia gives each Function it's own type and it's not possible to iterate through a Tuple of mixed type on the GPU. This is the error we get when iterating on the Tuple. I think this is one of those Julia-specific issues (because no one else would "accidentally" send a Tuple of mixed type to the GPU to begin with).
Julia cannot introspect a Tuple / Array of functions being passed in, which results in the function pointers not being callable (even if we did bypass the error).
On the CUDA side, it's not a great idea to send functions across CUDA modules in C. There are workarounds for this, but the main one is to statically compile the code in stages (there are even cmake commands to do this set(CUDA_SEPARABLE_COMPILATION ON)).

To solve these issues, we would basically need Julia to change to be either more generic with functions / function pointers or by being more clever with type introspection. If that was possible, then we could get around the compilation issue by allowing for more flexibility for when certain code is compiled (for example, we could compile code from the DSL into "__device__" functions that are then called after static compilation).

Anyway, long story short. No way this is going to be fixed any time soon, but it was good to at least finally document the issue.

It seems like some people are working on this from the Vulkan side (as an extension).

Realistically, we can only truly fix this (i.e., without having to re-specialize the entire module and thus not save any compile time) if we ever get proper cross-module function pointers, which is up to NVIDIA. Lacking that, we can only make the ergonomics slightly better, but I'm not sure it's going to be much better than just passing a tuple of functions. As noted, that doesn't entirely work because of https://github.com/JuliaGPU/GPUCompiler.jl/issues/607, but with some Julia-level unrolling of the for loop it should be possible to get specialized (GPU-compatible) code for that.

Note that the situation in C isn't much better; the entire GPU module contains all host and device functions, so you don't really get function pointers.

with some Julia-level unrolling of the for loop it should be possible to get specialized (GPU-compatible) code for that

For example, to make the example from https://github.com/JuliaGPU/GPUCompiler.jl/issues/607 work:

using Metal, Unrolled

@unroll function kernel(a, t)
    @unroll for x in t
        @inbounds a[1] = x
    end
    return
end

function main()
    a = Metal.ones(1)
    @metal kernel(a, (1, 1f0))
end

I'd apply that to the MWE posted here, but that one already works fine...

julia> using CUDA

julia> f(x) = x+1
julia> g(x) = x*2

julia> function call_fxs!(fxs)
           x = 1
           for i = 1:length(fxs)
               x = fxs[1](x)
               @cuprintf("%g\n",x)
           end
       end

julia> @cuda threads = 1 call_fxs!((f, g))
9.88131e-324
1.4822e-323

@leios Does that sufficiently cover your needs?

Yeah, loop unrolling was another thing I tried for my "real" application, but I really needed something more general.

That said, I think we have enough information here for anyone who stumbles across these errors to find a solution / workaround for their problem.

JuliaGPU / CUDA.jl

Device function pointers #2450