Closed wsmoses closed 1 week ago
I don't get why this is needed. We construct Broadcasted
objects in other places, so why is this one problematic?
The problem here is that all of the arguments to the kernel are passed together [as the broadcasted object]. As a result it's impossible to mark some arguments as differentiable and others as non-differentiable [leading to errors if differentiating x .+ y, where x is constant and y is diferentiable].
Preserving the separate arguments remedies this
The problem here is that all of the arguments to the kernel are passed together [as the broadcasted object].
Right, and we do that for all broadcasts. So why is this only a problem for map!
, as you said in https://github.com/JuliaGPU/GPUArrays.jl/issues/535:
The implementation of map! (https://github.com/JuliaGPU/GPUArrays.jl/blob/ec9fe5b6f7522902e444c95a0c9248a4bc55d602/src/host/broadcast.jl#L120C46-L120C59) creates a broadcasted object which captures all of the arguments to map!. This is then passed to the kernel.
FWIW, If this kind of pattern is an issue for Enzyme, you'll run into it a lot, because passing arguments through structures is how closures work in Julia.
It may not be just a problem for map, but map is for sure at a critical juncture point.
In particular, Enzyme often doesn't care if things are put into closures, if it gets optimized out (and potentially separated/etc).
One significant issue here, however, is that the custom rule for the kernel call forces us to see the entire closure at the Julia level [and thus we can't separate the args out even if we wanted to]. Since this creates the kernel which is called, I'm not sure of a way to resolve the issue without this.
It also doesn't matter if all the variables are differentiable [but in the context of a broadcast it is much more common for one to say x .+ y where one isn't differentiable].
Doing something like this also might have other downstream benefits as well. For example arrays being passed separately may permit alias information to be propagated, whereas indirection via a closure would likely drop alias info.
bump @maleadt for thoughts?
CI failures are related, so I figured you were still working on this.
Enzyme often doesn't care if things are put into closures
I don't get your subsequent reasoning. Enzyme obviously seems to care here, and I don't understand what the difference is between the broadcast kernel, and the following pattern we use all over CUDA.jl:
function outer(x::CuArray)
function kernel()
# do something with x, captured through the kernel closure
return
end
@cuda kernel()
end
The kernel
closure, capturing x
, is pretty similar to a Broadcasted
object capturing arguments and being passed to a functoin.
Oh yeah sorry I wanted to finish discussing before continuing work on to make sure the strategy was fine with you.
Ah I was assuming the closure was within the cuda kernel. However your case doesn’t present an issue.
The problem Enzyme has is creating a data structure with one variable being a differentiable cuarray and a second variable being non differentiable cuarray — and that data structure being passed into a custom rule (specifically the custom rule for cuda kernel launch).
your case above is fine since the closure only has one cuarray and thus can’t contain both a differentiable and non differentiable cuarray. It’ll be only one or the other
your case above is fine since the closure only has one cuarray and thus can’t contain both a differentiable and non differentiable cuarray.
Yeah that was just for simplicity of the example, we often capture more than one input. So while I'm not particularly opposed to this workaround, even though I do dislike it, you'll be running into other situations like this. I would assume that the situation will also arise when, e.g., passing tuples or CuArrays?
partially it depends on context. It's really common or one to differentiate active x .+ inactve y [e.g. one is constant other is not], but doinga tuple kernel where these are user-speciied different activities feels sufficiently uncommon that I'm willing to figure that out when we get there
@maleadt mind giving permission to run the pr?
Had to revert:
julia> using ForwardDiff, CUDA
julia> x = CUDA.ones(4,4)
4×4 CuArray{Float32, 2, CUDA.DeviceMemory}:
1.0 1.0 1.0 1.0
1.0 1.0 1.0 1.0
1.0 1.0 1.0 1.0
1.0 1.0 1.0 1.0
julia> y = ForwardDiff.Dual.(x, true)
ERROR: GPU compilation of MethodInstance for (::GPUArrays.var"#35#37")(::CUDA.CuKernelContext, ::CuDeviceMatrix{…}, ::Int64, ::CUDA.CuArrayStyle{…}, ::Type{…}, ::Tuple{…}, ::Base.Broadcast.Extruded{…}, ::Bool) failed
KernelError: passing and using non-bitstype argument
Argument 6 to your kernel function is of type Type{ForwardDiff.Dual}, which is not isbits
Attempt fix for https://github.com/JuliaGPU/GPUArrays.jl/issues/535
I think this should be cost free (in fact ideally slightly improve alias analysis/etc)
@maleadt @vchuravy