Open b-fg opened 12 months ago
@kernel function kern()
I = @index(Global,Cartesian)
@show I
end
end
kern(CPU(), 64)(ndrange=(2, 3, 4))
I = CartesianIndex(1, 1, 1)
I = CartesianIndex(2, 1, 1)
I = CartesianIndex(1, 2, 1)
I = CartesianIndex(2, 2, 1)
I = CartesianIndex(1, 3, 1)
I = CartesianIndex(2, 3, 1)
I = CartesianIndex(1, 1, 2)
I = CartesianIndex(2, 1, 2)
I = CartesianIndex(1, 2, 2)
I = CartesianIndex(2, 2, 2)
I = CartesianIndex(1, 3, 2)
I = CartesianIndex(2, 3, 2)
I = CartesianIndex(1, 1, 3)
I = CartesianIndex(2, 1, 3)
I = CartesianIndex(1, 2, 3)
I = CartesianIndex(2, 2, 3)
I = CartesianIndex(1, 3, 3)
I = CartesianIndex(2, 3, 3)
I = CartesianIndex(1, 1, 4)
I = CartesianIndex(2, 1, 4)
I = CartesianIndex(1, 2, 4)
I = CartesianIndex(2, 2, 4)
I = CartesianIndex(1, 3, 4)
I = CartesianIndex(2, 3, 4)
Seems to follow the same iteration order as:
for i in CartesianIndices((2,3,4))
@show i
end
i = CartesianIndex(1, 1, 1)
i = CartesianIndex(2, 1, 1)
i = CartesianIndex(1, 2, 1)
i = CartesianIndex(2, 2, 1)
i = CartesianIndex(1, 3, 1)
i = CartesianIndex(2, 3, 1)
i = CartesianIndex(1, 1, 2)
i = CartesianIndex(2, 1, 2)
i = CartesianIndex(1, 2, 2)
i = CartesianIndex(2, 2, 2)
i = CartesianIndex(1, 3, 2)
i = CartesianIndex(2, 3, 2)
i = CartesianIndex(1, 1, 3)
i = CartesianIndex(2, 1, 3)
i = CartesianIndex(1, 2, 3)
i = CartesianIndex(2, 2, 3)
i = CartesianIndex(1, 3, 3)
i = CartesianIndex(2, 3, 3)
i = CartesianIndex(1, 1, 4)
i = CartesianIndex(2, 1, 4)
i = CartesianIndex(1, 2, 4)
i = CartesianIndex(2, 2, 4)
i = CartesianIndex(1, 3, 4)
i = CartesianIndex(2, 3, 4)
One thing to note that your kern(CPU(), 64)
is equivalent to (64, 1, 1)
.
So I am not surprised that R1 = CartesianIndices((1:1,1:N,1:N))
is slow. For that I would expect you needing (1, 64, 1)
.
Say what now? This isn't spilt up automatically?
Well I thought I had documented that clearly, but I seem to not find it...
Take a look at: https://juliagpu.github.io/KernelAbstractions.jl/stable/examples/performance/
the workgroupsize is also a tuple where you provide the dimensions of the workgroup.
Ok. That explains everything. Well need to redo our macro to make the workgroup size adaptive. While I'm at it, is there a way to predict the best workgroup size to use?
Understood @vchuravy! Thanks for clarifying.
A bit more on this, it looks like if we try to evaluate the required workgroupsize
during runtime using
workgroupsize = ntuple(j->j==argmax(size(R)) ? 64 : 1, length(size(R)))
and then passing it to the kernel argument, this results in much slower kernels. On the other hand, hardcoding it to 64
or (64,1,1)
results in a much faster computation. Is there a specific reason why this might be happening?
Edit:
Actually, I have seen that the macro that generates the kernel is sometimes failing to produce the expected result of workgroupsize
. So for example, when size(R)=(258,258,1)
it results in (1,1,1)
(it should be (64,64,1)
) and this is why it is slow. So this is not a KA problem I believe, but the way we are generating the kernels in the macro... cc @weymouth.
Together with @weymouth we are trying to create a kernel that loops over an n-dimensional array and applies a function to each element. While we can certainly achieve to do so, the speedup we observe when comparing
@kernel
("KA") and non-@kernel
("serial") implementations is very different depending of the array slice we want to access. This is probably related to Julia being C-major, but the difference is strikingly here and KA does not perform as well as the serial version.Here is a simple MWE that demonstrates this, and this has been run with
julia -t 1
to force a single thread and draw comparisons between KA and serial implementation. There is also an additional GPU test added for comparison, where the same issue is detected.The timings are:
Is there something wrong in the MWE? Could this be done differently? It would be nice to learn about this. Thanks!