I just ran into a case where bad codegen, not incomplete inference, resulted in calls to the runtime that prevents us from using Cassette for GPU codegen. Pasting my notes here:
using Cassette
Cassette.@context Noop
function main()
a = [0]
function kernel(T, ptr)
unsafe_store!(ptr, 1)
return
end
kernel(Int, pointer(a))
code_llvm(kernel, Tuple{Type{Int}, Ptr{Int}}; debuginfo=:none)
# good code; T and ptr end up in different slots, T is marked constant
# https://github.com/JuliaLang/julia/blob/592878623d376c71e5452dc2775fa2f7a4e097ca/src/codegen.cpp#L6116
# define void @julia_kernel_12936(%jl_value_t*, i64) #0 {
# top:
# %2 = inttoptr i64 %1 to i64*
# store i64 1, i64* %2, align 1
# ret void
# }
Cassette.overdub(Noop(), kernel, Int, pointer(a))
code_llvm(Cassette.overdub, Tuple{typeof(Noop()), typeof(kernel), Type{Int}, Ptr{Int}}; debuginfo=:none)
# bad code; T and ptr are in the vaSlot, which is not constant or unused.
# the varargs slot isn't concretely typed, so we get a call to jl_f_tuple
# and calls to jl_f_getfield to access values
# https://github.com/JuliaLang/julia/blob/592878623d376c71e5452dc2775fa2f7a4e097ca/src/codegen.cpp#L6198-L6201
# define void @"julia_#4_12937"(%jl_value_t* nonnull, i64) #0 {
# top:
# ...
# %19 = call nonnull %jl_value_t* @jl_f_tuple(%jl_value_t*l_value_t* null to %jl_value_t*), %jl_value_** %2, i32 2)
# ...
# %23 = call nonnull %jl_value_t* @jl_f_getfield(%jl_value_t*l_value_t* null to %jl_value_t*), %jl_value_t** %2, i32 2)
# %24 = bitcast %jl_value_t* %23 to i64**
# %25 = load i64*, i64** %24, align 8
# store i64 1, i64* %25, align 1
# ...
# ret void
# }
function kernel(ptr)
unsafe_store!(ptr, 1)
return
end
Cassette.overdub(Noop(), kernel, pointer(a))
code_llvm(Cassette.overdub, Tuple{typeof(Noop()), typeof(kernel), Ptr{Int}}; debuginfo=:none)
# good code; we still have a vaSlot but it's concretely typed.
# https://github.com/JuliaLang/julia/blob/592878623d376c71e5452dc2775fa2f7a4e097ca/src/codegen.cpp#L6193-L6196
end
isinteractive() || main()
I just ran into a case where bad codegen, not incomplete inference, resulted in calls to the runtime that prevents us from using Cassette for GPU codegen. Pasting my notes here:
The splat comes from how arguments are passed to/by
overdub
. It would be possible to try and change that, or by having the ability to force specializing varargs (e.g. https://github.com/JuliaLang/julia/issues/34365#issuecomment-573989610).