Open AhmedSalih3d opened 1 year ago
And I just checked by using all CuArray and checking the exact kernel:
@cuda NNlibCUDA.scatter_kernel!(+,DST,S,I) ERROR: InvalidIRError: compiling kernel #scatter_kernel!(typeof(+), CuDeviceVector{SVector{3, Float32}, 1}, CuDeviceVector{SVector{3, Float32}, 1}, CuDeviceVector{Int64, 1}) resulted in invalid LLVM IR Reason: unsupported dynamic function invocation (call to atomic_cas!)
A 3-element SVector of Float32 is too wide of a type to be atomically updated. Notice how https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#atomic-functions says "on one 32-bit or 64-bit word". A SVector{3, Float32}
will be at least 3x32 = 96 bits in memory, so there's no way to atomically update it.
The only way short of writing your own custom locking/lock-free algorithm for this would be to update each component individually like you eventually settled on in the Discourse thread. Or if your update function to scatter
operates element-wise, reinterpret your N-length SVector array into a 3xN 2D array because we support multidimensional inputs.
Hi!
Using the package flux I want to scatter the following using NNlib:
Which works no problem. If I change the mid array to CuArray, then it works again, but tested that it is slow for large arrays (60k):
If I try to do everything on GPU:
Which I think is an error?
More info: https://discourse.julialang.org/t/how-to-reduce-an-array/92945/14
Kind regards