FluxML / NNlib.jl

Neural Network primitives with multiple backends
Other
203 stars 122 forks source link

SVectors on GPU CuArray cannot use index from CuArray #507

Open AhmedSalih3d opened 1 year ago

AhmedSalih3d commented 1 year ago

Hi!

Using the package flux I want to scatter the following using NNlib:

using Flux

NNlib.scatter(+, [SVector(1,1,1),SVector(1,1,1),SVector(1,1,1)], [3,1,2])

3-element Vector{SVector{3, Int64}}:
 [1, 1, 1]
 [1, 1, 1]
 [1, 1, 1]

Which works no problem. If I change the mid array to CuArray, then it works again, but tested that it is slow for large arrays (60k):

NNlib.scatter(+, CuArray([SVector(1,1,1),SVector(1,1,1),SVector(1,1,1)]), [3,1,2])

3-element CuArray{SVector{3, Int64}, 1, CUDA.Mem.DeviceBuffer}:
 [1, 1, 1]
 [1, 1, 1]
 [1, 1, 1]

If I try to do everything on GPU:

NNlib.scatter(+, CuArray([SVector(1,1,1),SVector(1,1,1),SVector(1,1,1)]), CuArray([3,1,2]))

ERROR: InvalidIRError: compiling kernel #scatter_kernel!(typeof(+), CuDeviceVector{SVector{3, Int64}, 1}, CuDeviceVector{SVector{3, Int64}, 1}, CuDeviceVector{Int64, 1}) resulted in invalid LLVM IR
Reason: unsupported dynamic function invocation (call to atomic_cas!)

Which I think is an error?

More info: https://discourse.julialang.org/t/how-to-reduce-an-array/92945/14

Kind regards

AhmedSalih3d commented 1 year ago

And I just checked by using all CuArray and checking the exact kernel:

@cuda NNlibCUDA.scatter_kernel!(+,DST,S,I) ERROR: InvalidIRError: compiling kernel #scatter_kernel!(typeof(+), CuDeviceVector{SVector{3, Float32}, 1}, CuDeviceVector{SVector{3, Float32}, 1}, CuDeviceVector{Int64, 1}) resulted in invalid LLVM IR Reason: unsupported dynamic function invocation (call to atomic_cas!)

ToucheSir commented 1 year ago

A 3-element SVector of Float32 is too wide of a type to be atomically updated. Notice how https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#atomic-functions says "on one 32-bit or 64-bit word". A SVector{3, Float32} will be at least 3x32 = 96 bits in memory, so there's no way to atomically update it.

The only way short of writing your own custom locking/lock-free algorithm for this would be to update each component individually like you eventually settled on in the Discourse thread. Or if your update function to scatter operates element-wise, reinterpret your N-length SVector array into a 3xN 2D array because we support multidimensional inputs.