cp2k / dbcsr

DBCSR: Distributed Block Compressed Sparse Row matrix library
https://cp2k.github.io/dbcsr/
GNU General Public License v2.0
135 stars 46 forks source link

acc: Always set active device before accessing a GPU #684

Closed oschuett closed 1 year ago

oschuett commented 1 year ago

I've had for a while the problem that some CP2K regtests would fail with cudaErrorInvalidResourceHandle when I ran on two GPUs. I believe this is because the active device gets changed between calls to DBCSR.

Presumably, most users are not affect by this because Slurm et al. set CUDA_VISIBLE_DEVICES to pin GPUs to CPU sockets.

dev-zero commented 1 year ago

Presumably, most users are not affect by this because Slurm et al. set CUDA_VISIBLE_DEVICES to pin GPUs to CPU sockets.

You would certainly wish for that, but no. The default slurm setting with gres is rather primitive and the more advanced one via nvml, rsmi or openapi which would do this affinity pinning is only available if slurm is built against the respective gpu runtime.

hfp commented 1 year ago

I think this does not happen anymore. Rank-device combination was stable in recent versions. However and as Ole said, multi-GPU did not even work in the past (let alone d2d comm), and I used this script in the past to prefix the CP2K-executable (it uses environment variables exposed by OpenMPI or MPICH/2 to round-robin assign devices). On the other hand, a change in DBCSR can be useful to affinitize devices if the job manager lacks configuration (or is not used). It can be intriguing without tools to find which socket a rank belong to, and which device belongs to a socket, etc.

oschuett commented 1 year ago

The default slurm setting with gres is rather primitive...

Interesting! Then it's indeed strange that not more users have run into this.

What do you think about this PR? Instead of setting the active device before every call to acc_interface_* we could just do it once at the beginning of dbcsr_multiply.

alazzaro commented 1 year ago

@oschuett I will take a look at the PR next week and hopefully close everything to make a new DBCSR release...

alazzaro commented 1 year ago

OK, I think this is a reasonable change and it is going in the direction of the issue https://github.com/cp2k/dbcsr/issues/205