cp2k / dbcsr

DBCSR: Distributed Block Compressed Sparse Row matrix library
https://cp2k.github.io/dbcsr/
GNU General Public License v2.0
134 stars 45 forks source link

CUDA RUNTIME API error: DeviceSetLimit failed with error cudaErrorInvalidValue #776

Open hfp opened 3 months ago

hfp commented 3 months ago
          This PR seems to cause:

CUDA RUNTIME API error: DeviceSetLimit failed with error cudaErrorInvalidValue.

( tested on H100 device )

Originally posted by @hfp in https://github.com/cp2k/dbcsr/issues/767#issuecomment-2034752764

alazzaro commented 3 months ago

According to the CUDA description:

cudaLimitPrintfFifoSize controls the size in bytes of the shared FIFO used by the printf() device system call. Setting cudaLimitPrintfFifoSize must not be performed after launching any kernel that uses the printf() device system call - in such case cudaErrorInvalidValue will be returned.

But then we don't call any printf (all are masked). And I don't understand why we see this problem only on H100...

hfp commented 3 months ago

I do not understand it either, I have simply not root-caused the issue let alone reporting the software versions like CUDA (or HPCSDK). I am currently retrying with this change.

alazzaro commented 3 months ago

I do not understand it either, I have simply not root-caused the issue let alone reporting the software versions like CUDA (or HPCSDK). I am currently retrying with this change.

it makes sense...

hfp commented 3 months ago

Since DeviceSetLimit is governed by ACC_API_CALL, the symbol NDEBUG must not be defined for reproducing the issue.

alazzaro commented 2 months ago

Let's leave this ticket open... I think the issue here is when the RT fails to build a kernel, but I'm not sure...

alazzaro commented 2 months ago

(Taking over from https://github.com/cp2k/dbcsr/pull/777#issuecomment-2059160289)

I think we can move the call to a more convenient place...

What do you suggest? Putting it into acc_init may not be the right thing as it is device specific.

I wonder if the code in question should be removed entirely?

I start to think this is the right solution... But need more time to investigate it (see my previous comment).