Open luciano-drozda opened 2 years ago
I let it in this comment.
I let it in this comment.
I am a bit confused by that code snipped. What is the quantity z
in the example? In my understanding, we want to take the Jacobian $\frac{\partial s_i}{\partial a_j}$, which would have shape 4x4
?
Also, why is the issue closed? The code is not in the introduction notebook.
Hi @renatobellotti ! I'm reopening the issue as the example remains to be added in the notebook, indeed.
z
is some scalar value that Enzyme expects to find in the end of your computational graph.
The kernel f!(s, a, b)
computing the sum of two arrays a
and b
is just part of this computational graph.
The adjoint code then computes all the sensitivities $$\overline{v} := \frac{\partial \texttt{z}}{\partial v},$$ where $v$ is any variable of the computational graph.
As an example, below is a full computational graph that ends up by computing
$$\texttt{z} := ||s||_{2}\ ,$$ where $s\equiv a + b$ was computed by the kernel f!(s, a, b)
.
You can refer to Baydin et al. (2018) for more on automatic differentiation.
The best place to add this would be https://github.com/EnzymeAD/Enzyme.jl/tree/main/examples
Contributions welcome!
Thanks for the explanations, @luciano-drozda ! I think I understood now what is going on. The following Youtube video helped me to create a simple example: https://www.youtube.com/watch?v=gFfePK44ICk
using Enzyme
function example!(x, y, z)
z[1] = 2 * x[1] + y[1]
z[2] = 3 * x[2] + y[2]
z[3] = 4 * x[3] + y[3]
z[4] = 5 * x[4] + y[4]
return nothing
end
x = [1., 2, 3, 4]
y = [5., 6, 7, 8]
z = zeros(4)
for i ∈ 1:4
r = zeros(4)
r[i] = 1.
dz_dx = zeros(length(x));
dx = Duplicated(x, dz_dx)
dy = Const(y)
dz = Duplicated(z, r)
autodiff(example!, Const, dx, dy, dz)
# The shadow copy contains the derivatives dz / dx_i, i. e. the columns of the Jacobian.
@show dx.dval;
# The values contain the actual result of the computations.
# @show dz
end
Next we convert this code to a kernel as described in the video at about 26:00:
using KernelAbstractions
@kernel function example_kernel(x, y, z)
i = @index(Global)
if(i == 1)
z[i] = 2 * x[i] + y[i]
elseif (i == 2)
z[i] = 3 * x[i] + y[i]
elseif (i == 3)
z[i] = 4 * x[i] + y[i]
elseif (i == 4)
z[i] = 5 * x[i] + y[i]
end
nothing
end
x = [1., 2, 3, 4]
y = [5., 6, 7, 8]
z = zeros(4)
cpu_kernel = example_kernel(CPU(), 4)
event = cpu_kernel(x, y, z, ndrange=4)
wait(event)
@show z;
And we use Enzyme to create the autodiff kernel and launch it:
using KernelGradients
using CUDAKernels
using CUDA
for i ∈ 1:4
x = cu([1., 2, 3, 4])
y = cu([5., 6, 7, 8])
z = cu(zeros(4))
dz_dx = similar(x)
fill!(dz_dx, 0)
dz_dy = similar(y)
fill!(dz_dy, 0)
r = zeros(4)
r[i] = 1.
r = cu(r)
dx = Duplicated(x, dz_dx)
dy = Duplicated(y, dz_dy)
dz = Duplicated(z, r)
# Code adapted from: https://www.youtube.com/watch?v=gFfePK44ICk (~26:00)
gpu_kernel_autodiff = autodiff(example_kernel(CUDADevice()))
event = gpu_kernel_autodiff(dx, dy, dz, ndrange=4)
wait(event)
@show dx.dval;
end
Feel free to use the example in the docs. Perhaps it helps somebody!
Now I have a question how to combine this with array programming. I know that I can do a simple reduction on the GPU using reduce(+, z)
. Is it possible to chain autodiff through a custom kernel and an array operation? For example, can I sum up all entries of z
, where z
is the result of the example_kernel()
described in my previous post?
You can use ChainRules.jl to define a rule for your custom kernel using Enzyme and then use Zygote to differentiate through the array operations.
Thanks for your answer. I still run into problems even with my toy example:
using ChainRulesCore
using CUDA
using CUDAKernels
using Enzyme
using KernelAbstractions
using KernelGradients
using Zygote
@kernel function example_kernel(x, y, z)
i = @index(Global)
if(i == 1)
z[i] = 2 * x[i] + y[i]
elseif (i == 2)
z[i] = 3 * x[i] + y[i]
elseif (i == 3)
z[i] = 4 * x[i] + y[i]
elseif (i == 4)
z[i] = 5 * x[i] + y[i]
end
nothing
end
function calculate_z(x, y; Δx=cu(ones(4)))
z = cu(zeros(4))
dz_dx = similar(x)
fill!(dz_dx, 0)
r = Δx
dx = Duplicated(x, dz_dx)
dy = Const(y)
dz = Duplicated(z, r)
gpu_kernel_autodiff = autodiff(example_kernel(CUDADevice()))
event = gpu_kernel_autodiff(dx, dy, dz, ndrange=4)
wait(event)
return dz.val, dx.dval
end
function frule((_, Δx, Δy), ::typeof(calculate_z), x, y)
return calculate_z(x, y, Δx=Δx)
end
function calculate_loss(x, y)
z, dz_dx = calculate_z(x, y)
loss = reduce(+, z)
return loss
end
x = cu([1., 2, 3, 4])
y = cu([5., 6, 7, 8])
# This works:
# calculate_loss(x, y)
# This does not work:
Zygote.gradient(calculate_loss, x, y)
The error message below refers to try/catch statements. However, I am not using any. I assume this is somehow related to the way in which I'm calling the kernel? But ChainRules shouldn't even try to go inside that function because I've defined a forward rule. Might there be a problem with the optional argument?
Compiling Tuple{CUDA.var"##context!#63", Bool, typeof(context!), CUDA.var"#194#195"{CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, UInt32, DataType}, CuContext}: try/catch is not supported.
Refer to the Zygote documentation for fixes.
https://fluxml.ai/Zygote.jl/latest/limitations
Stacktrace:
[1] error(s::String)
@ Base ./error.jl:33
[2] instrument(ir::IRTools.Inner.IR)
@ Zygote ~/.julia/packages/Zygote/D7j8v/src/compiler/reverse.jl:121
[3] #Primal#23
@ ~/.julia/packages/Zygote/D7j8v/src/compiler/reverse.jl:205 [inlined]
[4] Zygote.Adjoint(ir::IRTools.Inner.IR; varargs::Nothing, normalise::Bool)
@ Zygote ~/.julia/packages/Zygote/D7j8v/src/compiler/reverse.jl:322
[5] _generate_pullback_via_decomposition(T::Type)
@ Zygote ~/.julia/packages/Zygote/D7j8v/src/compiler/emit.jl:101
[6] #s2812#1078
@ ~/.julia/packages/Zygote/D7j8v/src/compiler/interface2.jl:28 [inlined]
[7] var"#s2812#1078"(::Any, ctx::Any, f::Any, args::Any)
@ Zygote ./none:0
[8] (::Core.GeneratedFunctionStub)(::Any, ::Vararg{Any})
@ Core ./boot.jl:580
[9] _pullback
@ ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/state.jl:161 [inlined]
[10] _pullback(::Zygote.Context, ::typeof(context!), ::CUDA.var"#194#195"{CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, UInt32, DataType}, ::CuContext)
@ Zygote ~/.julia/packages/Zygote/D7j8v/src/compiler/interface2.jl:0
[11] _pullback
@ ~/.julia/packages/CUDA/DfvRa/src/array.jl:615 [inlined]
[12] _pullback(::Zygote.Context, ::typeof(fill!), ::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, ::Int64)
@ Zygote ~/.julia/packages/Zygote/D7j8v/src/compiler/interface2.jl:0
[13] _pullback
@ ./In[3]:29 [inlined]
[14] _pullback(::Zygote.Context, ::var"##calculate_z#2", ::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, ::typeof(calculate_z), ::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, ::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer})
@ Zygote ~/.julia/packages/Zygote/D7j8v/src/compiler/interface2.jl:0
[15] _pullback
@ ./In[3]:26 [inlined]
[16] _pullback(::Zygote.Context, ::typeof(calculate_z), ::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, ::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer})
@ Zygote ~/.julia/packages/Zygote/D7j8v/src/compiler/interface2.jl:0
[17] _pullback
@ ./In[3]:51 [inlined]
[18] _pullback(::Zygote.Context, ::typeof(calculate_loss), ::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, ::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer})
@ Zygote ~/.julia/packages/Zygote/D7j8v/src/compiler/interface2.jl:0
[19] _pullback(::Function, ::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, ::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer})
@ Zygote ~/.julia/packages/Zygote/D7j8v/src/compiler/interface.jl:34
[20] pullback(::Function, ::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, ::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer})
@ Zygote ~/.julia/packages/Zygote/D7j8v/src/compiler/interface.jl:40
[21] gradient(::Function, ::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, ::Vararg{CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}})
@ Zygote ~/.julia/packages/Zygote/D7j8v/src/compiler/interface.jl:75
[22] top-level scope
@ In[3]:64
[23] eval
@ ./boot.jl:373 [inlined]
[24] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
@ Base ./loading.jl:1196
You should be defining an rrule
since you are using reverse mode AD
I replaced the frule with an rrule and get the same error:
function ChainRulesCore.rrule(::typeof(calculate_z), x, y, z)
z, ∂z_∂x = calculate_z(x, y)
function calculate_z_pullback(z̄)
f̄ = NoTangent()
x̄ = Tangent(∂z_∂x)
ȳ = NoTangent()
z̄ = Tangent(cu([1., 1, 1, 1]))
return f̄, x̄, ȳ, z̄
end
return z, calculate_z_pullback
end
You need to define it over calculate_z
Thanks for the hint, I adjusted my previous post accordingly. The error persists.
I wrote complete example: https://github.com/JuliaDiff/ChainRules.jl/issues/665#issuecomment-1220772450
Feel free to take it for the docs!
Could you please add an example with a simple CUDA kernel in the Julia introduction notebook ?