CUDA.@atomic does not support complex variables

albertomercurio commented 1 year ago

The minimal working example to show my problem is the following

using CUDA
CUDA.allowscalar(false)

function sum_kernel(a, b, n)
    tid = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    stride = blockDim().x * gridDim().x
    while tid <= n
        # CUDA.@atomic b[] += a[tid]
        CUDA.atomic_add!(CUDA.pointer(b, 1), a[tid])
        tid += stride
    end
    return
end

function sum2(a)
    n = length(a)
    b = CuArray([zero(a[1])])
    dev_a = CuArray(a)
    @cuda threads=256 blocks=(n + 255) ÷ 256 sum_kernel(dev_a, b, n)
    r = Array(b)[1]
    return r
end

a = rand(Float64, 1000)
sum2(a) ≈ sum(a) # is true

a = rand(ComplexF64, 1000)
sum2(a)

returns the following error

InvalidIRError: compiling MethodInstance for sum_kernel(::CuDeviceVector{ComplexF64, 1}, ::CuDeviceVector{ComplexF64, 1}, ::Int64) resulted in invalid LLVM IR
Reason: unsupported dynamic function invocation (call to atomic_add!)
Stacktrace:
 [1] sum_kernel
   @ [./](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/mnt/c/Users/alber/OneDrive/Documents/GitHub/Research/2022/Fast%20Eigenvalues/)In[21]:6
Hint: catch this exception as `err` and call `code_typed(err; interactive = true)` to introspect the erronous code with Cthulhu.jl

It doesn't work with both CUDA.@atomic and CUDA.atomic_add!.

maleadt commented 1 year ago

CUDA.atomic_add!(CUDA.pointer(b, 1), a[tid])

The atomic_add! family of functions directly maps onto hardware features, as the docstring mentions. So it is expected that Complex etc are not supported.

@atomic could support it if the logic in https://github.com/JuliaGPU/CUDA.jl/blob/bb37b50006295833d5396d1c7b330eec55b408e4/src/device/intrinsics/atomics.jl#L204-L208 were extended (https://github.com/JuliaLang/julia/pull/47116 probably can help to make that logic simpler), but you would still be limited to the maximal width of atomic operations that your hardware supports. That means ComplexF64 currently can not be supported by CUDA.@atomic.

It may be possible to make our generic fallback handle data types that are wider than the atomics supported by the hardware, e.g. by taking a lock, but that would be both slow, tricky to implement (i.e. to avoid deadlocks on hardware without forward-progress guarantee), and would require a different API as we can't easily allocate a global memory lock automatically from kernel code.

albertomercurio commented 1 year ago

What is the workaround for the moment? How to reduce a complex vector in a CUDA Kernel? I need this for a future implementation in a more complex kernel.

maleadt commented 1 year ago

How to reduce a complex vector in a CUDA Kernel?

Our mapreduce kernel does not use atomic operations. You should write your reduction similarly so that it doesn't require atomic operations, or pass a lock-like variable to your kernel that you use to protect the 128-bit data.

JuliaGPU / CUDA.jl

CUDA.@atomic does not support complex variables #1994