Closed pxl-th closed 3 months ago
I think this is due to the EnzymeRules for KernelAbstractions not supporting reverse mode yet
Oh, I see. I saw tests in KernelAbstractions for reverse mode and though that it works.
the KA custom rule is implemented for any backend in forward mode, and the CPU backend in reverse
I don't actually remember what was needed for reverse GPU support
We needed to prexompite the GPU relevant/interpreted tape size from outside the kernel.
So we need a variant of thunk tape computation that allows for a different device
On Mon, Sep 18, 2023 at 12:04 PM Valentin Churavy @.***> wrote:
I don't actually remember what was needed for reverse GPU support
— Reply to this email directly, view it on GitHub https://github.com/EnzymeAD/Enzyme.jl/issues/1061#issuecomment-1724002697, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJTUXD7CG46R7XEC5NBCF3X3B5IRANCNFSM6AAAAAA44UTTAU . You are receiving this because you commented.Message ID: <EnzymeAD/Enzyme. @.***>
I think this is due to the EnzymeRules for KernelAbstractions not supporting reverse mode yet
Actually, is this also the case if I want to differentiate just the kernel (no host code involved)?
nope that would be fine
I see there are tests for reverse for CUDA.jl: https://github.com/EnzymeAD/Enzyme.jl/blob/7d99eec57328329eba693f04aefcdd45f9420e3e/test/cuda.jl#L14
But when I try the same with KA, it errors:
ERROR: return type is Union{}, giving up.
Stacktrace:
[1] error(s::String)
@ Base ./error.jl:35
[2] autodiff_deferred
@ Main ~/.julia/packages/Enzyme/0SYwj/src/Enzyme.jl:456 [inlined]
[3] autodiff_deferred
@ Main ~/.julia/packages/Enzyme/0SYwj/src/Enzyme.jl:442 [inlined]
[4] main2()
@ Main ~/code/t.jl:110
[5] top-level scope
@ REPL[3]:1
[6] top-level scope
@ ~/.julia/packages/CUDA/35NC6/src/initialization.jl:190
using CUDA
using KernelAbstractions
using Enzyme
@kernel function ker(x)
i = @index(Global)
x[i] *= x[i]
end
function main()
kab = CUDABackend()
x = KA.ones(kab, Float32, 16)
dx = KA.ones(kab, Float32, 16)
Enzyme.autodiff_deferred(Reverse, ker(kab), Duplicated(x, dx))
return
end
main()
I'm probably doing things incorrectly, but I haven't found the example with KA with just a single kernel... :/
Actually, test for CUDA.jl also gives this error:
function mul_kernel(A)
i = threadIdx().x
if i <= length(A)
A[i] *= A[i]
end
return nothing
end
function main()
A = CUDA.ones(64,)
dA = CUDA.ones(64,)
autodiff_deferred(Reverse, mul_kernel, Const, Duplicated(A, dA))
return
end
I'm using CUDA 4.4.1, Enzyme 0.11.7 and Julia 1.10-beta2
So I got confused, but with CUDA.jl if you wrap in
function mul_kernel(A)
i = threadIdx().x
A[i] *= A[i]
return nothing
end
function grad(A, dA)
autodiff_deferred(Reverse, mul_kernel, Duplicated(A, dA))
return nothing
end
And call @cuda threads=length(A) grad(A, dA)
, then it works (which is still confusing a bit).
But with KernelAbstractions I cannot figure out how to do this. The only example involves host code: https://github.com/JuliaGPU/KernelAbstractions.jl/blob/3165d35b9b707e73d19e7f8fc9f442bafaf415ac/test/extensions/enzyme.jl#L10
Is there a way to AD just the kernel?
@wsmoses, sorry for spamming, but are there any examples with KA not involving host code (just the kernel)?
You should be able to use autodiff_deferred inside the kernel itself (like your grad case). The KA example you showed is for the custom rules nicer support, but that's only enabled for forward mode in KA.jl rn.
For reverse mode, you'll have to set it up manually like your mul_kernel above where the autodiff call is inside the device code entirely
autodiff call is inside the device code entirely
Oh, I see! Now it works! A note somewhere in the docs might be useful (unless I missed one). Thanks for the help!
It works for the mul_kernel
, however fails when using with more complex kernels.
For example, with sin
function.
Error:
ERROR: InvalidIRError: compiling MethodInstance for gpu_gker(::KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}}}, ::AMDGPU.Device.ROCDeviceVector{Float32, 1}, ::AMDGPU.Device.ROCDeviceVector{Float32, 1}) resulted in invalid LLVM IR
Reason: unsupported call through a literal pointer (call to )
Stacktrace:
[1] #sin
@ ~/.julia/dev/AMDGPU/src/device/gcn/math.jl:32
[2] ker
@ ~/code/ZipNerf.jl/t.jl:7
[3] ker
@ ~/code/ZipNerf.jl/t.jl:0
[4] diffejulia_ker_5228_inner_1wrap
@ ~/code/ZipNerf.jl/t.jl:0
[5] macro expansion
@ ~/.julia/packages/Enzyme/VS5jo/src/compiler.jl:9774
[6] enzyme_call
@ ~/.julia/packages/Enzyme/VS5jo/src/compiler.jl:9452
[7] CombinedAdjointThunk
@ ~/.julia/packages/Enzyme/VS5jo/src/compiler.jl:9415
[8] autodiff_deferred
@ ~/.julia/packages/Enzyme/VS5jo/src/Enzyme.jl:372
[9] autodiff_deferred
@ ~/.julia/packages/Enzyme/VS5jo/src/Enzyme.jl:459
[10] autodiff_deferred
@ ~/.julia/packages/Enzyme/VS5jo/src/Enzyme.jl:442
[11] macro expansion
@ ~/code/ZipNerf.jl/t.jl:18
[12] gpu_gker
@ ~/.julia/packages/KernelAbstractions/cWlFz/src/macros.jl:90
[13] gpu_gker
@ ./none:0
Reason: unsupported call through a literal pointer (call to )
Stacktrace:
[1] #sin
@ ~/.julia/dev/AMDGPU/src/device/gcn/math.jl:32
[2] ker
@ ~/code/ZipNerf.jl/t.jl:7
[3] ker
@ ~/code/ZipNerf.jl/t.jl:0
[4] diffejulia_ker_5228_inner_1wrap
@ ~/code/ZipNerf.jl/t.jl:0
...
Code:
using AMDGPU
using KernelAbstractions
using Enzyme
import KernelAbstractions as KA
@inline function ker(x, i)
x[i] *= sin(x[i])
return
end
@kernel function fker(x)
i = @index(Global)
ker(x, i)
end
@kernel function gker(x, dx)
i = @index(Global)
Enzyme.autodiff_deferred(Reverse, ker, Duplicated(x, dx), i)
end
function main()
kab = ROCBackend()
x = KA.ones(kab, Float32, 16)
dx = KA.ones(kab, Float32, 16)
fker(kab)(x; ndrange=length(x))
@show x
gker(kab)(x, dx; ndrange=length(x))
@show dx
return
end
Yeah that's the same as https://github.com/EnzymeAD/Enzyme.jl/issues/683
On Mon, Sep 25, 2023 at 9:13 AM Anton Smirnov @.***> wrote:
It works for the mul_kernel, however fails when using with more complex kernels. For example, with sin function.
Error:
ERROR: InvalidIRError: compiling MethodInstance for gpu_gker(::KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}}}, ::AMDGPU.Device.ROCDeviceVector{Float32, 1}, ::AMDGPU.Device.ROCDeviceVector{Float32, 1}) resulted in invalid LLVM IR Reason: unsupported call through a literal pointer (call to ) Stacktrace: [1] #sin @ ~/.julia/dev/AMDGPU/src/device/gcn/math.jl:32 [2] ker @ ~/code/ZipNerf.jl/t.jl:7 [3] ker @ ~/code/ZipNerf.jl/t.jl:0 [4] diffejulia_ker_5228_inner_1wrap @ ~/code/ZipNerf.jl/t.jl:0 [5] macro expansion @ ~/.julia/packages/Enzyme/VS5jo/src/compiler.jl:9774 [6] enzyme_call @ ~/.julia/packages/Enzyme/VS5jo/src/compiler.jl:9452 [7] CombinedAdjointThunk @ ~/.julia/packages/Enzyme/VS5jo/src/compiler.jl:9415 [8] autodiff_deferred @ ~/.julia/packages/Enzyme/VS5jo/src/Enzyme.jl:372 [9] autodiff_deferred @ ~/.julia/packages/Enzyme/VS5jo/src/Enzyme.jl:459 [10] autodiff_deferred @ ~/.julia/packages/Enzyme/VS5jo/src/Enzyme.jl:442 [11] macro expansion @ ~/code/ZipNerf.jl/t.jl:18 [12] gpu_gker @ ~/.julia/packages/KernelAbstractions/cWlFz/src/macros.jl:90 [13] gpu_gker @ ./none:0 Reason: unsupported call through a literal pointer (call to ) Stacktrace: [1] #sin @ ~/.julia/dev/AMDGPU/src/device/gcn/math.jl:32 [2] ker @ ~/code/ZipNerf.jl/t.jl:7 [3] ker @ ~/code/ZipNerf.jl/t.jl:0 [4] diffejulia_ker_5228_inner_1wrap @ ~/code/ZipNerf.jl/t.jl:0...
Code:
using AMDGPUusing KernelAbstractionsusing Enzymeimport KernelAbstractions as KA @inline function ker(x, i) x[i] *= sin(x[i]) returnend @kernel function fker(x) i = @index(Global) ker(x, i)end @kernel function gker(x, dx) i = @index(Global) Enzyme.autodiff_deferred(Reverse, ker, Duplicated(x, dx), i)end function main() kab = ROCBackend() x = KA.ones(kab, Float32, 16) dx = KA.ones(kab, Float32, 16)
fker(kab)(x; ndrange=length(x)) @show x gker(kab)(x, dx; ndrange=length(x)) @show dx returnend
— Reply to this email directly, view it on GitHub https://github.com/EnzymeAD/Enzyme.jl/issues/1061#issuecomment-1733796640, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJTUXB5HYQPM4ZYTI45AS3X4GGSJANCNFSM6AAAAAA44UTTAU . You are receiving this because you were mentioned.Message ID: @.***>
Yeah that's the same as #683
Just curious if the fix is coming relatively soon or is it more involved?
It's unfortunately more involved.
@aviatesk do you have cycles to help us with the nested abstract interpreter issues?
cc @ChrisRackauckas
@pxl-th the AMDGPU issues are resolved by https://github.com/EnzymeAD/Enzyme.jl/pull/1537
Hi! I'm trying to use fused kernel
compute_α_fused
to compute alpha-composing weights and use Enzyme to generate gradient kernel inReverse
mode instead ofcompute_α
.But the compilation fails. Is this the issue with CUDA.jl?
Error:
Code: