Invalid Configuration at 1024 Threads?

alpaka-group / alpaka

Abstraction Library for Parallel Kernel Acceleration :llama:

https://alpaka.readthedocs.io

Mozilla Public License 2.0

353 stars 72 forks source link

Invalid Configuration at 1024 Threads? #1943

Closed GNiendorf closed 1 year ago

GNiendorf commented 1 year ago

I run into an invalid configuration error when trying to run a kernel that has a thread size of (1, 32, 32) with Alpaka using the CUDA backend, whereas that same kernel launched using just CUDA runs fine. If I reduce the thread size (I've tried 1,16,16 so far) it runs fine using Alpaka. Does Alpaka place stricter limits on the maximum number of threads per block?

fwyzard commented 1 year ago

hi @GNiendorf, I'm not aware of any limitations imposed by alpaka on the grid and block size.

From a quick test on my laptop, I can run with a block size of (1, 32, 32) threads per block:

Testing VectorAddKernel3D with vector indices with a grid of (5, 5, 1) blocks x (1, 32, 32) threads x (1, 1, 1) elements...
success

Would you have a way to reproduce the issue, that we could try and look into ?

GNiendorf commented 1 year ago

I think it may have something to do with register usage actually, since I noticed there are other kernels that run fine with that same thread size. I'll have to look into it more to see what's going on, since I don't have a way to reproduce this issue with standalone code.

psychocoderHPC commented 1 year ago

@GNiendorf If you compile with CMake you can add the parameter -Dalpaka_CUDA_SHOW_REGISTER=ON (or the CXX compiler -Xcuda-ptxas=-v )and nvcc will show the register, compile-time shared memory and stack frame usage per kernel. This information can be put into the cuda occupancy calculated (xls-sheet) to analyze the limiter.

j-stephan commented 1 year ago

Hi @GNiendorf, is this still an issue?

GNiendorf commented 1 year ago

Hi @GNiendorf, is this still an issue?

I didn't get a chance to look into this further, but this was also on a very old version of Alpaka (0.7 I think). This is no longer an issue for our code so I'll close it.