JuliaGPU / CUDA.jl

CUDA programming in Julia.
https://juliagpu.org/cuda/
Other
1.2k stars 215 forks source link

[FR] Mixed eltype dot products #982

Open marius311 opened 3 years ago

marius311 commented 3 years ago

These don't seem to currently work (CUDA 3.3):

CuVector{Float32}(undef,10) ⋅ CuVector{Complex{Float32}}(undef,10)
CuVector{Float32}(undef,10) ⋅ CuVector{Float64}(undef,10)
# other mixed eltypes, etc...

as they fall back to a generic which triggers scalar indexing. It would be nice to have these implemented, even with a simple sum(conj.(x) .* y) or something, which at least works on GPU.

Maybe worth mentioning, but I didn't really care about these until upgrading Zygote 0.6.11 -> 0.6.12, since a recent change (bisected down to https://github.com/FluxML/Zygote.jl/pull/973) seems to make it so Zygote emits such dot products where previously it wasn't. Here's an example which triggers scalar indexing after that commit but not before:

using CUDA, Zygote, LinearAlgebra
CUDA.allowscalar(false)

x = cu(rand(10))
y = complex(cu(rand(10)))

Zygote.gradient(1) do A
    norm(Diagonal(A .* x) * y)
end

Anyway, this isn't really relevant for CUDA.jl, but figured might provide some context.

marius311 commented 3 years ago

(with a little guidance on the strategy, I could probably hazard a PR myself if it'd be easier)

maleadt commented 3 years ago

Couple of things:

In addition, with dot we could add a fallback method (i.e. without element-type constraints) that just does a'*b, falling back to the well-optimized GEMM implementation. On the other hand, when GEMM fails to select a fast implementation it'll use the horribly slow GPUArrays.jl implementation, in which case sum(a.*b) might be a better fallback (which uses two kernels, and an intermediate allocation, so there's quite some overhead)