Looping over high-dimensional arrays

b-fg commented 12 months ago

Together with @weymouth we are trying to create a kernel that loops over an n-dimensional array and applies a function to each element. While we can certainly achieve to do so, the speedup we observe when comparing @kernel ("KA") and non-@kernel ("serial") implementations is very different depending of the array slice we want to access. This is probably related to Julia being C-major, but the difference is strikingly here and KA does not perform as well as the serial version.

Here is a simple MWE that demonstrates this, and this has been run with julia -t 1 to force a single thread and draw comparisons between KA and serial implementation. There is also an additional GPU test added for comparison, where the same issue is detected.

using CUDA
using BenchmarkTools

N=256
u_cu = CUDA.zeros(N,N,N,3);
u = Array(u_cu);
R1 = CartesianIndices((1:1,1:N,1:N))
R2 = CartesianIndices((1:N,1:1,1:N))
R3 = CartesianIndices((1:N,1:N,1:1))

test_serial(a,i,R) = for I ∈ R
    a[I,i] = √log1p(Float32(prod(I.I)))
end
using KernelAbstractions
@kernel function kern(a,@Const(i))
    I = @index(Global,Cartesian)
    a[I,i] = √log1p(Float32(prod(I.I)))
end
test_kernel(a,i,R) = kern(get_backend(a),64)(a,i,ndrange=size(R))

@btime test_serial($u,1,$R1)
@btime test_serial($u,2,$R2)
@btime test_serial($u,3,$R3)
@btime test_kernel($u,1,$R1)
@btime test_kernel($u,2,$R2)
@btime test_kernel($u,3,$R3)
@btime CUDA.@sync test_kernel($u_cu,1,$R1)
@btime CUDA.@sync test_kernel($u_cu,2,$R2)
@btime CUDA.@sync test_kernel($u_cu,3,$R3)

The timings are:

950.046 μs (0 allocations: 0 bytes) # serial R1
928.321 μs (0 allocations: 0 bytes) # serial R2
998.689 μs (0 allocations: 0 bytes) # serial R3
5.500 ms (10 allocations: 256 bytes) # KA CPU R1
1.045 ms (10 allocations: 256 bytes) # KA CPU R2
991.688 μs (10 allocations: 256 bytes) # KA CPU R3
217.867 μs (42 allocations: 2.23 KiB) # KA GPU R1
13.077 μs (42 allocations: 2.23 KiB) # KA GPU R2
14.328 μs (42 allocations: 2.23 KiB) # KA GPU R3

Is there something wrong in the MWE? Could this be done differently? It would be nice to learn about this. Thanks!

vchuravy commented 12 months ago

@kernel function kern()
           I = @index(Global,Cartesian)
           @show I
       end
end

kern(CPU(), 64)(ndrange=(2, 3, 4))
I = CartesianIndex(1, 1, 1)
I = CartesianIndex(2, 1, 1)
I = CartesianIndex(1, 2, 1)
I = CartesianIndex(2, 2, 1)
I = CartesianIndex(1, 3, 1)
I = CartesianIndex(2, 3, 1)
I = CartesianIndex(1, 1, 2)
I = CartesianIndex(2, 1, 2)
I = CartesianIndex(1, 2, 2)
I = CartesianIndex(2, 2, 2)
I = CartesianIndex(1, 3, 2)
I = CartesianIndex(2, 3, 2)
I = CartesianIndex(1, 1, 3)
I = CartesianIndex(2, 1, 3)
I = CartesianIndex(1, 2, 3)
I = CartesianIndex(2, 2, 3)
I = CartesianIndex(1, 3, 3)
I = CartesianIndex(2, 3, 3)
I = CartesianIndex(1, 1, 4)
I = CartesianIndex(2, 1, 4)
I = CartesianIndex(1, 2, 4)
I = CartesianIndex(2, 2, 4)
I = CartesianIndex(1, 3, 4)
I = CartesianIndex(2, 3, 4)

Seems to follow the same iteration order as:

for i in CartesianIndices((2,3,4))
          @show i
end
i = CartesianIndex(1, 1, 1)
i = CartesianIndex(2, 1, 1)
i = CartesianIndex(1, 2, 1)
i = CartesianIndex(2, 2, 1)
i = CartesianIndex(1, 3, 1)
i = CartesianIndex(2, 3, 1)
i = CartesianIndex(1, 1, 2)
i = CartesianIndex(2, 1, 2)
i = CartesianIndex(1, 2, 2)
i = CartesianIndex(2, 2, 2)
i = CartesianIndex(1, 3, 2)
i = CartesianIndex(2, 3, 2)
i = CartesianIndex(1, 1, 3)
i = CartesianIndex(2, 1, 3)
i = CartesianIndex(1, 2, 3)
i = CartesianIndex(2, 2, 3)
i = CartesianIndex(1, 3, 3)
i = CartesianIndex(2, 3, 3)
i = CartesianIndex(1, 1, 4)
i = CartesianIndex(2, 1, 4)
i = CartesianIndex(1, 2, 4)
i = CartesianIndex(2, 2, 4)
i = CartesianIndex(1, 3, 4)
i = CartesianIndex(2, 3, 4)

vchuravy commented 12 months ago

One thing to note that your kern(CPU(), 64) is equivalent to (64, 1, 1).

So I am not surprised that R1 = CartesianIndices((1:1,1:N,1:N)) is slow. For that I would expect you needing (1, 64, 1).

weymouth commented 12 months ago

Say what now? This isn't spilt up automatically?

vchuravy commented 12 months ago

Well I thought I had documented that clearly, but I seem to not find it...

Take a look at: https://juliagpu.github.io/KernelAbstractions.jl/stable/examples/performance/

the workgroupsize is also a tuple where you provide the dimensions of the workgroup.

weymouth commented 12 months ago

Ok. That explains everything. Well need to redo our macro to make the workgroup size adaptive. While I'm at it, is there a way to predict the best workgroup size to use?

b-fg commented 12 months ago

Understood @vchuravy! Thanks for clarifying.

b-fg commented 10 months ago

A bit more on this, it looks like if we try to evaluate the required workgroupsize during runtime using

workgroupsize = ntuple(j->j==argmax(size(R)) ? 64 : 1, length(size(R)))

and then passing it to the kernel argument, this results in much slower kernels. On the other hand, hardcoding it to 64 or (64,1,1) results in a much faster computation. Is there a specific reason why this might be happening?

Edit:

Actually, I have seen that the macro that generates the kernel is sometimes failing to produce the expected result of workgroupsize. So for example, when size(R)=(258,258,1) it results in (1,1,1) (it should be (64,64,1)) and this is why it is slow. So this is not a KA problem I believe, but the way we are generating the kernels in the macro... cc @weymouth.

JuliaGPU / KernelAbstractions.jl

Looping over high-dimensional arrays #438