Closed GNiendorf closed 1 year ago
hi @GNiendorf, I'm not aware of any limitations imposed by alpaka on the grid and block size.
From a quick test on my laptop, I can run with a block size of (1, 32, 32) threads per block:
Testing VectorAddKernel3D with vector indices with a grid of (5, 5, 1) blocks x (1, 32, 32) threads x (1, 1, 1) elements...
success
Would you have a way to reproduce the issue, that we could try and look into ?
I think it may have something to do with register usage actually, since I noticed there are other kernels that run fine with that same thread size. I'll have to look into it more to see what's going on, since I don't have a way to reproduce this issue with standalone code.
@GNiendorf If you compile with CMake you can add the parameter -Dalpaka_CUDA_SHOW_REGISTER=ON
(or the CXX compiler -Xcuda-ptxas=-v
)and nvcc will show the register, compile-time shared memory and stack frame usage per kernel.
This information can be put into the cuda occupancy calculated (xls-sheet) to analyze the limiter.
Hi @GNiendorf, is this still an issue?
Hi @GNiendorf, is this still an issue?
I didn't get a chance to look into this further, but this was also on a very old version of Alpaka (0.7 I think). This is no longer an issue for our code so I'll close it.
I run into an invalid configuration error when trying to run a kernel that has a thread size of (1, 32, 32) with Alpaka using the CUDA backend, whereas that same kernel launched using just CUDA runs fine. If I reduce the thread size (I've tried 1,16,16 so far) it runs fine using Alpaka. Does Alpaka place stricter limits on the maximum number of threads per block?