Splat gpu launch - Githubissues

wsmoses commented 1 month ago

Attempt fix for https://github.com/JuliaGPU/GPUArrays.jl/issues/535

I think this should be cost free (in fact ideally slightly improve alias analysis/etc)

@maleadt @vchuravy

maleadt commented 1 month ago

I don't get why this is needed. We construct Broadcasted objects in other places, so why is this one problematic?

wsmoses commented 3 weeks ago

The problem here is that all of the arguments to the kernel are passed together [as the broadcasted object]. As a result it's impossible to mark some arguments as differentiable and others as non-differentiable [leading to errors if differentiating x .+ y, where x is constant and y is diferentiable].

Preserving the separate arguments remedies this

maleadt commented 3 weeks ago

The problem here is that all of the arguments to the kernel are passed together [as the broadcasted object].

Right, and we do that for all broadcasts. So why is this only a problem for map!, as you said in https://github.com/JuliaGPU/GPUArrays.jl/issues/535:

The implementation of map! (https://github.com/JuliaGPU/GPUArrays.jl/blob/ec9fe5b6f7522902e444c95a0c9248a4bc55d602/src/host/broadcast.jl#L120C46-L120C59) creates a broadcasted object which captures all of the arguments to map!. This is then passed to the kernel.

FWIW, If this kind of pattern is an issue for Enzyme, you'll run into it a lot, because passing arguments through structures is how closures work in Julia.

wsmoses commented 3 weeks ago

It may not be just a problem for map, but map is for sure at a critical juncture point.

In particular, Enzyme often doesn't care if things are put into closures, if it gets optimized out (and potentially separated/etc).

One significant issue here, however, is that the custom rule for the kernel call forces us to see the entire closure at the Julia level [and thus we can't separate the args out even if we wanted to]. Since this creates the kernel which is called, I'm not sure of a way to resolve the issue without this.

It also doesn't matter if all the variables are differentiable [but in the context of a broadcast it is much more common for one to say x .+ y where one isn't differentiable].

Doing something like this also might have other downstream benefits as well. For example arrays being passed separately may permit alias information to be propagated, whereas indirection via a closure would likely drop alias info.

wsmoses commented 2 weeks ago

bump @maleadt for thoughts?

maleadt commented 2 weeks ago

CI failures are related, so I figured you were still working on this.

Enzyme often doesn't care if things are put into closures

I don't get your subsequent reasoning. Enzyme obviously seems to care here, and I don't understand what the difference is between the broadcast kernel, and the following pattern we use all over CUDA.jl:

function outer(x::CuArray)
    function kernel()
        # do something with x, captured through the kernel closure
        return
    end
    @cuda kernel()
end

The kernel closure, capturing x, is pretty similar to a Broadcasted object capturing arguments and being passed to a functoin.

wsmoses commented 2 weeks ago

Oh yeah sorry I wanted to finish discussing before continuing work on to make sure the strategy was fine with you.

Ah I was assuming the closure was within the cuda kernel. However your case doesn’t present an issue.

The problem Enzyme has is creating a data structure with one variable being a differentiable cuarray and a second variable being non differentiable cuarray — and that data structure being passed into a custom rule (specifically the custom rule for cuda kernel launch).

your case above is fine since the closure only has one cuarray and thus can’t contain both a differentiable and non differentiable cuarray. It’ll be only one or the other

maleadt commented 2 weeks ago

your case above is fine since the closure only has one cuarray and thus can’t contain both a differentiable and non differentiable cuarray.

Yeah that was just for simplicity of the example, we often capture more than one input. So while I'm not particularly opposed to this workaround, even though I do dislike it, you'll be running into other situations like this. I would assume that the situation will also arise when, e.g., passing tuples or CuArrays?

wsmoses commented 2 weeks ago

partially it depends on context. It's really common or one to differentiate active x .+ inactve y [e.g. one is constant other is not], but doinga tuple kernel where these are user-speciied different activities feels sufficiently uncommon that I'm willing to figure that out when we get there

wsmoses commented 2 weeks ago

@maleadt mind giving permission to run the pr?

maleadt commented 1 week ago

Had to revert:

julia> using ForwardDiff, CUDA

julia> x = CUDA.ones(4,4)
4×4 CuArray{Float32, 2, CUDA.DeviceMemory}:
 1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0

julia> y = ForwardDiff.Dual.(x, true)
ERROR: GPU compilation of MethodInstance for (::GPUArrays.var"#35#37")(::CUDA.CuKernelContext, ::CuDeviceMatrix{…}, ::Int64, ::CUDA.CuArrayStyle{…}, ::Type{…}, ::Tuple{…}, ::Base.Broadcast.Extruded{…}, ::Bool) failed
KernelError: passing and using non-bitstype argument

Argument 6 to your kernel function is of type Type{ForwardDiff.Dual}, which is not isbits

JuliaGPU / GPUArrays.jl

Splat gpu launch #539