Open leios opened 3 months ago
julia> @cuda threads = 1 call_fxs!((f, g))
ERROR: InvalidIRError: compiling MethodInstance for call_fxs!(::Tuple{typeof(f), typeof(g)}) resulted in invalid LLVM IR
Reason: unsupported call to an unknown function (call to ijl_get_nth_field_checked)
[1] getindex
@ ./tuple.jl:31
[2] call_fxs!
@ ./REPL[7]:4
Hint: catch this exception as `err` and call `code_typed(err; interactive = true)` to introspect the erronous code with Cthulhu.jl
Some type information was truncated. Use `show(err)` to see complete types.
Error for similar code failing on AMDGPU:
ERROR: InvalidIRError: compiling MethodInstance for gpu_check(::KernelAbstractions.CompilerMetadata{…}, ::AMDGPU.Device.ROCDeviceVector{…}, ::Tuple{…}) resulted in invalid LLVM IR
Reason: unsupported call to an unknown function (call to ijl_get_nth_field_checked)
Reason: unsupported call to an unknown function (call to ijl_get_nth_field_checked)
[1] getindex
@ ./tuple.jl:31
[2] macro expansion
@ ./REPL[6]:4
[3] gpu_check
@ ~/.julia/packages/KernelAbstractions/MAxUm/src/macros.jl:95
[4] gpu_check
@ ./none:0
Hint: catch this exception as `err` and call `code_typed(err; interactive = true)` to introspect the erronous code with Cthulhu.jl
Just quickly documenting my recommedation here
julia> struct VTable{T}
julia> @generated function (VT::VTable{T})(fidx, args...) where T
N = length(T.parameters)
Base.Cartesian.@nif $(N+1) d->fidx==d d->return VT.funcs[d](args...) d->error("fidx oob")
julia> VT = VTable(((x)->x+1, (x)->x+2))
VTable{Tuple{var"#2#4", var"#3#5"}}((var"#2#4"(), var"#3#5"()))
julia> VT = VTable(((x)->x+1, (x)->x+2))^C
julia> VT(1, 2)
julia> VT(2, 2)
julia> VT(3, 2)
ERROR: fidx oob
[1] error(s::String)
@ Base ./error.jl:35
[2] macro expansion
@ ./REPL[3]:4 [inlined]
[3] (::VTable{Tuple{var"#2#4", var"#3#5"}})(fidx::Int64, args::Int64)
@ Main ./REPL[3]:1
[4] top-level scope
@ REPL[7]:1
julia> @code_typed VT(3, 2)
1 ─ %1 = (fidx === 1)::Bool
└── goto #3 if not %1
2 ─ %3 = Core.getfield(args, 1)::Int64
│ %4 = Base.add_int(%3, 1)::Int64
└── return %4
3 ─ %6 = (fidx === 2)::Bool
└── goto #5 if not %6
4 ─ %8 = Core.getfield(args, 1)::Int64
│ %9 = Base.add_int(%8, 2)::Int64
└── return %9
5 ─ invoke Main.error("fidx oob"::String)::Union{}
└── unreachable
) => Int64
If this leads to function blowup, then one might need to use:
@generated function (VT::VTable{T})(fidx, args...) where T
N = length(T.parameters)
Base.Cartesian.@nif $(N+1) d->fidx==d d->begin; f = VT.funcs[d]; @noinline f(args...); end d->error("fidx oob")
@maleadt said the following on slack and I would like to repeat it here:
Function pointers in CUDA are tricky, because the module you compile needs to contain all code. You can't compile a kernel and pass it an arbitrary function pointer to execute, the function pointer has to refer to something in the module (hence the module lookup shenanigans). So in that sense it isn't a real functoin pointer, it's more of symbol you look up in a previously compiled module. In addition though, if you have C source code like:
__device__ int foo(int x); __global__ void kernel(int *ptr());
... compiling those two together gives you addressable entities by just looking at the declarations. whereas in Julia:
foo(x::Int)::Int; kernel(ptr::Function);
... the foo function is unordaned, so not taken into account when compiling GPU code, but also passing foo by doing
@cuda kernel(foo)
doesn't give the compiler enough information to compile an addressible version of foo together with kernel, because in Julia foo only refers to a generic function, and not to a specialized methodfoo(::Int)::Int
. That requires inspecting the kernel for how you invoke foo, which quickly runs into other issues (what if you invoke foo with two differently-typed arguments? that would mean we need 2 versions of foo, but you're only passing a single function pointer...). In summary, lots of issues that will probably prevent us from every supporting this fully like in C.
So my understanding. There seem to be three (related) issues here:
).To solve these issues, we would basically need Julia to change to be either more generic with functions / function pointers or by being more clever with type introspection. If that was possible, then we could get around the compilation issue by allowing for more flexibility for when certain code is compiled (for example, we could compile code from the DSL into "__device__
" functions that are then called after static compilation).
Anyway, long story short. No way this is going to be fixed any time soon, but it was good to at least finally document the issue.
It seems like some people are working on this from the Vulkan side (as an extension).
Realistically, we can only truly fix this (i.e., without having to re-specialize the entire module and thus not save any compile time) if we ever get proper cross-module function pointers, which is up to NVIDIA. Lacking that, we can only make the ergonomics slightly better, but I'm not sure it's going to be much better than just passing a tuple of functions. As noted, that doesn't entirely work because of, but with some Julia-level unrolling of the for loop it should be possible to get specialized (GPU-compatible) code for that.
Note that the situation in C isn't much better; the entire GPU module contains all host and device functions, so you don't really get function pointers.
with some Julia-level unrolling of the for loop it should be possible to get specialized (GPU-compatible) code for that
For example, to make the example from work:
using Metal, Unrolled
@unroll function kernel(a, t)
@unroll for x in t
@inbounds a[1] = x
function main()
a = Metal.ones(1)
@metal kernel(a, (1, 1f0))
I'd apply that to the MWE posted here, but that one already works fine...
julia> using CUDA
julia> f(x) = x+1
julia> g(x) = x*2
julia> function call_fxs!(fxs)
x = 1
for i = 1:length(fxs)
x = fxs[1](x)
julia> @cuda threads = 1 call_fxs!((f, g))
@leios Does that sufficiently cover your needs?
Yeah, loop unrolling was another thing I tried for my "real" application, but I really needed something more general.
That said, I think we have enough information here for anyone who stumbles across these errors to find a solution / workaround for their problem.
Right, so simply put. I want the following code to work:
This is what the code looks like in CUDA C:
I've been banging my head against it for a long time (a few months before this post:
My current solution involves
loops on loops, which ends up generating functions that are quite large and take a significant amount of time (sometimes up to 70 s for a kernel that runs in 0.0001 s). Mentioned here: that exist in other languages:
I have had this discussion throughout the years with @vchuravy , @jpsamaroo , and @maleadt, but never documented it because I'm apparently the only one actually hitting the issue.
To be honest, I think we are approaching something that might not be fundamentally possible with Julia, but I would like to be able to pass in arbitrary functions to a kernel without forcing recompilation of any kind.
I am not sure if it is best to put this here or in GPUCompiler.
related discussions: