Add example with CUDA.jl

luciano-drozda commented 2 years ago

Could you please add an example with a simple CUDA kernel in the Julia introduction notebook ?

luciano-drozda commented 2 years ago

I let it in this comment.

renatobellotti commented 2 years ago

I let it in this comment.

I am a bit confused by that code snipped. What is the quantity z in the example? In my understanding, we want to take the Jacobian $\frac{\partial s_i}{\partial a_j}$, which would have shape 4x4?

Also, why is the issue closed? The code is not in the introduction notebook.

luciano-drozda commented 2 years ago

Hi @renatobellotti ! I'm reopening the issue as the example remains to be added in the notebook, indeed.

z is some scalar value that Enzyme expects to find in the end of your computational graph.

The kernel f!(s, a, b) computing the sum of two arrays a and b is just part of this computational graph.

The adjoint code then computes all the sensitivities $$\overline{v} := \frac{\partial \texttt{z}}{\partial v},$$ where $v$ is any variable of the computational graph.

As an example, below is a full computational graph that ends up by computing $$\texttt{z} := ||s||_{2}\ ,$$ where $s\equiv a + b$ was computed by the kernel f!(s, a, b).

bwd_diff

You can refer to Baydin et al. (2018) for more on automatic differentiation.

vchuravy commented 2 years ago

The best place to add this would be https://github.com/EnzymeAD/Enzyme.jl/tree/main/examples

Contributions welcome!

renatobellotti commented 2 years ago

Thanks for the explanations, @luciano-drozda ! I think I understood now what is going on. The following Youtube video helped me to create a simple example: https://www.youtube.com/watch?v=gFfePK44ICk

using Enzyme

function example!(x, y, z)
    z[1] = 2 * x[1] + y[1]
    z[2] = 3 * x[2] + y[2]
    z[3] = 4 * x[3] + y[3]
    z[4] = 5 * x[4] + y[4]
    return nothing
end

x = [1., 2, 3, 4]
y = [5., 6, 7, 8]
z = zeros(4)

for i ∈ 1:4
    r = zeros(4)
    r[i] = 1.
    dz_dx = zeros(length(x));

    dx = Duplicated(x, dz_dx)
    dy = Const(y)
    dz = Duplicated(z, r)

    autodiff(example!, Const, dx, dy, dz)

    # The shadow copy contains the derivatives dz / dx_i, i. e. the columns of the Jacobian.
    @show dx.dval;
    # The values contain the actual result of the computations.
    # @show dz
end

Next we convert this code to a kernel as described in the video at about 26:00:

using KernelAbstractions

@kernel function example_kernel(x, y, z)
    i = @index(Global)
    if(i == 1)
        z[i] = 2 * x[i] + y[i]
    elseif (i == 2)
        z[i] = 3 * x[i] + y[i]
    elseif (i == 3)
        z[i] = 4 * x[i] + y[i]
    elseif (i == 4)
        z[i] = 5 * x[i] + y[i]
    end
    nothing
end

x = [1., 2, 3, 4]
y = [5., 6, 7, 8]
z = zeros(4)

cpu_kernel = example_kernel(CPU(), 4)
event = cpu_kernel(x, y, z, ndrange=4)
wait(event)
@show z;

And we use Enzyme to create the autodiff kernel and launch it:

using KernelGradients
using CUDAKernels
using CUDA

for i ∈ 1:4
    x = cu([1., 2, 3, 4])
    y = cu([5., 6, 7, 8])
    z = cu(zeros(4))

    dz_dx = similar(x)
    fill!(dz_dx, 0)

    dz_dy = similar(y)
    fill!(dz_dy, 0)

    r = zeros(4)
    r[i] = 1.
    r = cu(r)

    dx = Duplicated(x, dz_dx)
    dy = Duplicated(y, dz_dy)
    dz = Duplicated(z, r)

    # Code adapted from: https://www.youtube.com/watch?v=gFfePK44ICk (~26:00)
    gpu_kernel_autodiff = autodiff(example_kernel(CUDADevice()))
    event = gpu_kernel_autodiff(dx, dy, dz, ndrange=4)
    wait(event)
    @show dx.dval;
end

Feel free to use the example in the docs. Perhaps it helps somebody!

renatobellotti commented 2 years ago

Now I have a question how to combine this with array programming. I know that I can do a simple reduction on the GPU using reduce(+, z). Is it possible to chain autodiff through a custom kernel and an array operation? For example, can I sum up all entries of z, where z is the result of the example_kernel() described in my previous post?

vchuravy commented 2 years ago

You can use ChainRules.jl to define a rule for your custom kernel using Enzyme and then use Zygote to differentiate through the array operations.

renatobellotti commented 2 years ago

Thanks for your answer. I still run into problems even with my toy example:

using ChainRulesCore
using CUDA
using CUDAKernels
using Enzyme
using KernelAbstractions
using KernelGradients
using Zygote

@kernel function example_kernel(x, y, z)
    i = @index(Global)
    if(i == 1)
        z[i] = 2 * x[i] + y[i]
    elseif (i == 2)
        z[i] = 3 * x[i] + y[i]
    elseif (i == 3)
        z[i] = 4 * x[i] + y[i]
    elseif (i == 4)
        z[i] = 5 * x[i] + y[i]
    end
    nothing
end

function calculate_z(x, y; Δx=cu(ones(4)))
    z = cu(zeros(4))

    dz_dx = similar(x)
    fill!(dz_dx, 0)

    r = Δx

    dx = Duplicated(x, dz_dx)
    dy = Const(y)
    dz = Duplicated(z, r)

    gpu_kernel_autodiff = autodiff(example_kernel(CUDADevice()))
    event = gpu_kernel_autodiff(dx, dy, dz, ndrange=4)
    wait(event)

    return dz.val, dx.dval
end

function frule((_, Δx, Δy), ::typeof(calculate_z), x, y)
    return calculate_z(x, y, Δx=Δx)
end

function calculate_loss(x, y)
    z, dz_dx = calculate_z(x, y)
    loss = reduce(+, z)
    return loss
end

x = cu([1., 2, 3, 4])
y = cu([5., 6, 7, 8])

# This works:
# calculate_loss(x, y)

# This does not work:
Zygote.gradient(calculate_loss, x, y)

The error message below refers to try/catch statements. However, I am not using any. I assume this is somehow related to the way in which I'm calling the kernel? But ChainRules shouldn't even try to go inside that function because I've defined a forward rule. Might there be a problem with the optional argument?

Compiling Tuple{CUDA.var"##context!#63", Bool, typeof(context!), CUDA.var"#194#195"{CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, UInt32, DataType}, CuContext}: try/catch is not supported.
Refer to the Zygote documentation for fixes.
https://fluxml.ai/Zygote.jl/latest/limitations

Stacktrace:
  [1] error(s::String)
    @ Base ./error.jl:33
  [2] instrument(ir::IRTools.Inner.IR)
    @ Zygote ~/.julia/packages/Zygote/D7j8v/src/compiler/reverse.jl:121
  [3] #Primal#23
    @ ~/.julia/packages/Zygote/D7j8v/src/compiler/reverse.jl:205 [inlined]
  [4] Zygote.Adjoint(ir::IRTools.Inner.IR; varargs::Nothing, normalise::Bool)
    @ Zygote ~/.julia/packages/Zygote/D7j8v/src/compiler/reverse.jl:322
  [5] _generate_pullback_via_decomposition(T::Type)
    @ Zygote ~/.julia/packages/Zygote/D7j8v/src/compiler/emit.jl:101
  [6] #s2812#1078
    @ ~/.julia/packages/Zygote/D7j8v/src/compiler/interface2.jl:28 [inlined]
  [7] var"#s2812#1078"(::Any, ctx::Any, f::Any, args::Any)
    @ Zygote ./none:0
  [8] (::Core.GeneratedFunctionStub)(::Any, ::Vararg{Any})
    @ Core ./boot.jl:580
  [9] _pullback
    @ ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/state.jl:161 [inlined]
 [10] _pullback(::Zygote.Context, ::typeof(context!), ::CUDA.var"#194#195"{CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, UInt32, DataType}, ::CuContext)
    @ Zygote ~/.julia/packages/Zygote/D7j8v/src/compiler/interface2.jl:0
 [11] _pullback
    @ ~/.julia/packages/CUDA/DfvRa/src/array.jl:615 [inlined]
 [12] _pullback(::Zygote.Context, ::typeof(fill!), ::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, ::Int64)
    @ Zygote ~/.julia/packages/Zygote/D7j8v/src/compiler/interface2.jl:0
 [13] _pullback
    @ ./In[3]:29 [inlined]
 [14] _pullback(::Zygote.Context, ::var"##calculate_z#2", ::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, ::typeof(calculate_z), ::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, ::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer})
    @ Zygote ~/.julia/packages/Zygote/D7j8v/src/compiler/interface2.jl:0
 [15] _pullback
    @ ./In[3]:26 [inlined]
 [16] _pullback(::Zygote.Context, ::typeof(calculate_z), ::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, ::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer})
    @ Zygote ~/.julia/packages/Zygote/D7j8v/src/compiler/interface2.jl:0
 [17] _pullback
    @ ./In[3]:51 [inlined]
 [18] _pullback(::Zygote.Context, ::typeof(calculate_loss), ::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, ::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer})
    @ Zygote ~/.julia/packages/Zygote/D7j8v/src/compiler/interface2.jl:0
 [19] _pullback(::Function, ::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, ::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer})
    @ Zygote ~/.julia/packages/Zygote/D7j8v/src/compiler/interface.jl:34
 [20] pullback(::Function, ::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, ::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer})
    @ Zygote ~/.julia/packages/Zygote/D7j8v/src/compiler/interface.jl:40
 [21] gradient(::Function, ::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, ::Vararg{CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}})
    @ Zygote ~/.julia/packages/Zygote/D7j8v/src/compiler/interface.jl:75
 [22] top-level scope
    @ In[3]:64
 [23] eval
    @ ./boot.jl:373 [inlined]
 [24] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
    @ Base ./loading.jl:1196

vchuravy commented 2 years ago

You should be defining an rrule since you are using reverse mode AD

renatobellotti commented 2 years ago

I replaced the frule with an rrule and get the same error:

function ChainRulesCore.rrule(::typeof(calculate_z), x, y, z)
    z, ∂z_∂x = calculate_z(x, y)

    function calculate_z_pullback(z̄)
        f̄ = NoTangent()
        x̄ = Tangent(∂z_∂x)
        ȳ = NoTangent()
        z̄ = Tangent(cu([1., 1, 1, 1]))

        return f̄, x̄, ȳ, z̄
    end

    return z, calculate_z_pullback
end

vchuravy commented 2 years ago

You need to define it over calculate_z

renatobellotti commented 2 years ago

Thanks for the hint, I adjusted my previous post accordingly. The error persists.

renatobellotti commented 2 years ago

I wrote complete example: https://github.com/JuliaDiff/ChainRules.jl/issues/665#issuecomment-1220772450

Feel free to take it for the docs!

EnzymeAD / Enzyme-Tutorial

Add example with CUDA.jl #2