[FR] Mixed eltype dot products

marius311 commented 3 years ago

These don't seem to currently work (CUDA 3.3):

CuVector{Float32}(undef,10) ⋅ CuVector{Complex{Float32}}(undef,10)
CuVector{Float32}(undef,10) ⋅ CuVector{Float64}(undef,10)
# other mixed eltypes, etc...

as they fall back to a generic which triggers scalar indexing. It would be nice to have these implemented, even with a simple sum(conj.(x) .* y) or something, which at least works on GPU.

Maybe worth mentioning, but I didn't really care about these until upgrading Zygote 0.6.11 -> 0.6.12, since a recent change (bisected down to https://github.com/FluxML/Zygote.jl/pull/973) seems to make it so Zygote emits such dot products where previously it wasn't. Here's an example which triggers scalar indexing after that commit but not before:

using CUDA, Zygote, LinearAlgebra
CUDA.allowscalar(false)

x = cu(rand(10))
y = complex(cu(rand(10)))

Zygote.gradient(1) do A
    norm(Diagonal(A .* x) * y)
end

Anyway, this isn't really relevant for CUDA.jl, but figured might provide some context.

marius311 commented 3 years ago

(with a little guidance on the strategy, I could probably hazard a PR myself if it'd be easier)

maleadt commented 3 years ago

Couple of things:

generally, the CUBLAS APIs are fairly limited in supported types, those wrappers are here: https://github.com/JuliaGPU/CUDA.jl/blob/c77e98549a33b770017ea9cc09b950a7f47d3ab7/lib/cublas/wrappers.jl#L138-L154
recently, NVIDIA has been adding more generic APIs that support more types, with dot that's cublasDotEx which @kshyatt added in https://github.com/JuliaGPU/CUDA.jl/pull/904 but only for Float16 types. Other combinations of types are supported too, see https://docs.nvidia.com/cuda/cublas/index.html#cublas-dotEx. Either we express this through methods, or we do it imperatively as with gemmEx (which is much more complicated, in terms of supported types): https://github.com/JuliaGPU/CUDA.jl/blob/c77e98549a33b770017ea9cc09b950a7f47d3ab7/lib/cublas/wrappers.jl#L807-L907
we then add high-level functionality, integrating with LinearAlgebra.dot, calling these wrappers: https://github.com/JuliaGPU/CUDA.jl/blob/c77e98549a33b770017ea9cc09b950a7f47d3ab7/lib/cublas/linalg.jl#L15-L25. Again, in the case of gemm this is a complicated routine that checks whether gemmEx is supported: https://github.com/JuliaGPU/CUDA.jl/blob/c77e98549a33b770017ea9cc09b950a7f47d3ab7/lib/cublas/linalg.jl#L176-L225

In addition, with dot we could add a fallback method (i.e. without element-type constraints) that just does a'*b, falling back to the well-optimized GEMM implementation. On the other hand, when GEMM fails to select a fast implementation it'll use the horribly slow GPUArrays.jl implementation, in which case sum(a.*b) might be a better fallback (which uses two kernels, and an intermediate allocation, so there's quite some overhead)

JuliaGPU / CUDA.jl

[FR] Mixed eltype dot products #982