Is it currently possible to concatenate AD calls over multiple GPU kernels? I'm currently trying to implement this on a toy problem and can't figure out how to do it.
using ChainRulesCore
using CUDA
using CUDAKernels
using Enzyme
using KernelAbstractions
using KernelGradients
using Zygote
@kernel function example_kernel(x, y, z)
i = @index(Global)
if(i == 1)
z[i] = 2 * x[i] + y[i]
elseif (i == 2)
z[i] = 3 * x[i] + y[i]
elseif (i == 3)
z[i] = 4 * x[i] + y[i]
elseif (i == 4)
z[i] = 5 * x[i] + y[i]
end
nothing
end
@kernel function example_kernel2(z, a, result)
i = @index(Global)
result[i] = 3 * z[i] + a[i]
nothing
end
function my_call!(x, y, a)
z = cu(zeros(4))
result = cu(zeros(4))
kernel = example_kernel(CUDADevice())
kernel2 = example_kernel2(CUDADevice())
event = kernel(x, y, z, ndrange=4)
wait(event)
event = kernel2(z, a, result, ndrange=4)
wait(event)
return result
end
dx = Duplicated(x, cu(zeros(Float32, 4)))
dy = Const(y)
dz = Duplicated(cu(zeros(Float32, 4)), cu(ones(Float32, 4)))
da = Const(a)
Enzyme.autodiff_deferred(my_call!, Const, dx, dy, da)
Running the code above fails and kills the kernel on a Jupyter notebook. Here is the error message:
How can I run AD over multiple kernels? My goal is to call multiple kernels to calculate a vector, then calculate a scalar loss function and optimise it with gradient descent.
Is it currently possible to concatenate AD calls over multiple GPU kernels? I'm currently trying to implement this on a toy problem and can't figure out how to do it.
Running the code above fails and kills the kernel on a Jupyter notebook. Here is the error message:
How can I run AD over multiple kernels? My goal is to call multiple kernels to calculate a vector, then calculate a scalar loss function and optimise it with gradient descent.