cp2k / dbcsr

DBCSR: Distributed Block Compressed Sparse Row matrix library
https://cp2k.github.io/dbcsr/
GNU General Public License v2.0
135 stars 47 forks source link

Generic kernel #812

Closed alazzaro closed 3 months ago

alazzaro commented 3 months ago

This PR introduces a generic (untuned) kernel for the ACC, if the tuned kernel is not present. That pushes the computation to the ACC (previously it was falling-back to the CPU with a big performance penalty).

The output changes accordingly, e.g.:

 -------------------------------------------------------------------------------
 -                                                                             -
 -                                DBCSR STATISTICS                             -
 -                                                                             -
 -------------------------------------------------------------------------------
 COUNTER                                    TOTAL       BLAS       SMM       ACC
 flops    13 x    13 x    13                43940       0.0%      0.0%    100.0%     
 flops     5 x    13 x    13              7351500       0.0%      0.0%    100.0% (*) 
 flops    13 x     5 x    13              7351500       0.0%      0.0%    100.0% (*) 
 flops    13 x    13 x     5              7351500       0.0%      0.0%    100.0% (*) 
 flops    18 x    13 x    13             26404560       0.0%      0.0%    100.0% (*) 
 flops    13 x    18 x    13             26404560       0.0%      0.0%    100.0% (*) 
 flops    13 x    13 x    18             26404560       0.0%      0.0%    100.0% (*) 
 flops     5 x     5 x    13           1229962500       0.0%      0.0%    100.0% (*) 
 flops     5 x    13 x     5           1229962500       0.0%      0.0%    100.0% (*) 
 flops    13 x     5 x     5           1229962500       0.0%      0.0%    100.0% (*) 
 flops    18 x     5 x    13           4417686000       0.0%      0.0%    100.0% (*) 
 flops     5 x    18 x    13           4417686000       0.0%      0.0%    100.0% (*) 
 flops     5 x    13 x    18           4417686000       0.0%      0.0%    100.0% (*) 
 flops    18 x    13 x     5           4417686000       0.0%      0.0%    100.0% (*) 
 flops    13 x     5 x    18           4417686000       0.0%      0.0%    100.0% (*) 
 flops    13 x    18 x     5           4417686000       0.0%      0.0%    100.0% (*) 
 flops    18 x    18 x    13          15867109440       0.0%      0.0%    100.0% (*) 
 flops    18 x    13 x    18          15867109440       0.0%      0.0%    100.0% (*) 
 flops    13 x    18 x    18          15867109440       0.0%      0.0%    100.0% (*) 
 flops     5 x     5 x     5         205782187500       0.0%      0.0%    100.0%     
 flops    18 x     5 x     5         739112850000       0.0%      0.0%    100.0% (*) 
 flops     5 x     5 x    18         739112850000       0.0%      0.0%    100.0% (*) 
 flops     5 x    18 x     5         739112850000       0.0%      0.0%    100.0% (*) 
 flops    18 x     5 x    18        2654689464000       0.0%      0.0%    100.0% (*) 
 flops    18 x    18 x     5        2654689464000       0.0%      0.0%    100.0% (*) 
 flops     5 x    18 x    18        2654689464000       0.0%      0.0%    100.0% (*) 
 flops    18 x    18 x    18        9534912226560       0.0%      0.0%    100.0%     

 *** WARNING in dbcsr_mm_sched.F:606 :: (*) ACC Untuned kernels, consider ***
 *** to run the tuning procedure                                          ***

*** WARNING in dbcsr_mm_sched.F:618  :: Some kernels are running on the   ***
*** CPU, consider to run the ACC tuning procedure for them                ***
alazzaro commented 3 months ago

For the record, I've disabled Daint CI for the moment (Daint will disappear soon). Changed the jenkins-cscs in https://github.com/cp2k/dbcsr/settings/access to be "Triage" role (requires "Write" to trigger the CI)

alazzaro commented 3 months ago

It turns out a generic kernel is hard to make it without the proper tuning procedure (see https://storage.googleapis.com/cp2k-ci/run-cp2k-cuda-pascal-47248377_report.txt for some of the failing kernels). Since we do test all generated kernels, the new proposed workflow is:

  1. check if a tuned kernel exists and use it
  2. if it doesn't exist, use the generic, if it works
  3. if the generic one doesn't work, fall-back to the CPU (previous case)

As far as I can see, only 56 tests are failing in the CP2K-CI on P100 (apparently with large kernels, e.g .55x55x14, 55x55x26, 38x38x2).

In the future, I would consider to switch to cublas for the case 3. For the moment, I think this is a step forward.

hfp commented 3 months ago

As far as I can see, only 56 tests are failing in the CP2K-CI on P100 (apparently with large kernels, e.g .55x55x14, 55x55x26, 38x38x2).

I think this is due to inappropriate implementation for larger kernels like too many registers used. I believe this goes along with long JIT-compilation of such kernel, i.e., the compiler goes crazy to avoid/implement spilling the excess registers. A way to circumvent this, is to implement a max-size inside of the kernel and to branch into a different flavor for larger kernels.

In the future, I would consider to switch to cublas for the case 3. For the moment, I think this is a step forward.

Yes, aggree. Though, I have not implemented calling MKL for GPUs in case of the OpenCL backend. However, the OpenCL backend validates for all kernel sizes up to the static maximum we have set for all GPUs.

hfp commented 3 months ago

That pushes the computation to the ACC (previously it was falling-back to the CPU with a big performance penalty).

As a note for other readers, a big penalty is due to the data already uploaded to the GPU rather then per-se slow CPU performance.

alazzaro commented 3 months ago

As far as I can see, only 56 tests are failing in the CP2K-CI on P100 (apparently with large kernels, e.g .55x55x14, 55x55x26, 38x38x2).

I think this is due to inappropriate implementation for larger kernels like too many registers used. I believe this goes along with long JIT-compilation of such kernel, i.e., the compiler goes crazy to avoid/implement spilling the excess registers. A way to circumvent this, is to implement a max-size inside of the kernel and to branch into a different flavor for larger kernels.

Yeap, your analysis is correct. I've decided to add a limit of kernel size (any dimension > 50). Otherwise, the compiler cannot compile (we get the error that cannot load the PTX):

CUDA DRIVER API ERROR: ModuleLoadDataEx failed with error CUDA_ERROR_INVALID_PTX (/opt/cp2k/exts/dbcsr/src/acc/libsmm_acc/libsmm_acc.cpp::181)

My speculation is that the JIT will mostly fail for large kernels. The entire procedure needs more checking.

A side note, it is risky to use kernels from previous architectures, unless we test them on the new one.

hfp commented 3 months ago

CUDA DRIVER API ERROR: ModuleLoadDataEx failed with error CUDA_ERROR_INVALID_PTX (/opt/cp2k/exts/dbcsr/src/acc/libsmm_acc/libsmm_acc.cpp::181)

... and you can be very happy about this error. Worst case is, it compiles a long time (much longer than normal) and produces broken code. Hat down for Nv's toolchain to know when it failed!

My speculation is that the JIT will mostly fail for large kernels. The entire procedure needs more checking.

Yes, there is remaining risk but not necessarily attributed to JIT compilation. Though, it's generally the same toolchain as the offline compiler. However, larger kernels really needs an implementation that intrinsically limits the register usage; the branch to decide about its use is cheap.

A side note, it is risky to use kernels from previous architectures, unless we test them on the new one.

The CUDA/HIP used several different kernel implementations. Did you choose only one of them for the generic kernel? My guess is if all of them got into the generic kernel (along with appropriate conditions), then you will not see failing kernels. The "appropriate" conditions however can be tricky. It can help "to learn" from the tuned cases when to select either flavor, e.g., it might not be "the size" (like M*N*K) but rather just M or N, or combination of, or something else. You will end up hard-coding some basic rules and then we can actually throw away the whole offline-prediction ;-)

alazzaro commented 3 months ago

A side note, it is risky to use kernels from previous architectures, unless we test them on the new one.

The CUDA/HIP used several different kernel implementations. Did you choose only one of them for the generic kernel? My guess is if all of them got into the generic kernel (along with appropriate conditions), then you will not see failing kernels. The "appropriate" conditions however can be tricky. It can help "to learn" from the tuned cases when to select either flavor, e.g., it might not be "the size" (like MNK) but rather just M or N, or combination of, or something else. You will end up hard-coding some basic rules and then we can actually throw away the whole offline-prediction ;-)

So, I went for the most used kernel type ("medium") and tried to figure the dependencies on the other parameters (tile_m, tile_n, threads, groupping, w, v). I tried several branches depending on the size, but it is hard (especially for rectangular blocks). In particular, for large kernels (55x55x55 for example), any combination is failing (I ended up doing the autotuning, almost all generated kernels combinations are failing. We cannot make it in production).

Now I'm testing an exteme case: always use the generic kernel. Let's see what the CP2K-CI tell us...

alazzaro commented 3 months ago

OK, the generic kernel passes all tests in the CP2K-CI (see https://storage.googleapis.com/cp2k-ci/run-cp2k-cuda-pascal-48ac3b6d_report.txt ). The wrong result is pre-existing to the PR.