cp2k / dbcsr

DBCSR: Distributed Block Compressed Sparse Row matrix library
https://cp2k.github.io/dbcsr/
GNU General Public License v2.0
134 stars 45 forks source link

Allow tunring off DBCSR ACC with env variable #801

Closed abussy closed 3 weeks ago

abussy commented 1 month ago

This PR allows turning off DBCSR GPU acceleration at initialization, even if the library is compiled with the -D__DBCSR_ACC flag. This can be done by setting the environment variable DBCSR_TURN_OFF_ACC=1, or by passing a logical to the dbcsr_set_config subroutine (this will be useful for a future CP2K keyword).

Most changes take place in the dbcsr_config.F file. There, a new function called use_acc() is defined. It returns a logical, and replaces the has_acc variable used in the rest of the code.

The PR was validated by successfully running the CP2K regtests on a GPU machine, with and without the DBCSR_TURN_OFF_ACC=1 environment variable.

Generally, the ability of turning off GPU acceleration at runtime should help with debugging, and with edge cases where DBCSR ACC under-performs (see #795).

jenkins-cscs commented 1 month ago

Can one of the admins verify this patch?

hfp commented 1 month ago

With the environment variable DBCSR_N_STACKS=0 existing already, does it make this PR superfluous or do you still want it?

alazzaro commented 1 month ago

We still need it since we still move data to the gpu and run transpose kernel.

Thanks @abussy , I will review the PR next week.

abussy commented 3 weeks ago

Gentle ping to review this PR :) It would be nice if it makes it to CP2K's next release

alazzaro commented 3 weeks ago

thanks for the PR. I will start the review, but first let me put here some general comments for my understanding of the proposed changes:

  1. we are replacing has_acc boolean parameter with use_acc function
  2. has_acc is used within the use_acc and when setting the default for mm_dense
  3. We use the new conf flag TURN_OFF_ACC to disable the GPU when DBCSR is compiled for GPU

OK, the first "logical" problem is TURN_OFF_ACC has to be TRUE to disable the GPU. I would prefer the opposite meaning, so we can use RUN_ON_GPU so it is TRUE by default whenever has_acc is TRUE and people can set to FALSE if you don't want to run on the GPU (we are not really "turning off" the GPU, simply we are not running on it...).

In terms of output, I will keep has_acc when we output the configuration.

The last problem is that it must forbidden to change this value multiple times within a DBCSR run, otherwise will get configuration problems (we are replacing a macro at compile time with something at runtime). This functionality is not available in the library, I need to think about (in any case this is not relevant to your PR).

abussy commented 3 weeks ago

Sure, using a positive boolean like RUN_ON_GPU makes sense, I agree.

Concerning the danger of changing this value multiple times during a run, I think it should be safe as it is. The use_acc() function is defined during the DBCSR configuration step.

alazzaro commented 3 weeks ago

Concerning the danger of changing this value multiple times during a run, I think it should be safe as it is. The use_acc() function is defined during the DBCSR configuration step.

The dbcsr_set_config is a public function, so people can call it after the init. Actually, this is what CP2K does. That means we can call it once with GPU_RUN disabled, then run a multiplication, then call set_config and set GPU_RUN to true and at this point I'm not sure what it will happen... We need to protect for this case, as said this is not part of this PR (I will open an issue as a reminder).

abussy commented 3 weeks ago

I implemented the vast majority of the suggested changes: