Closed alazzaro closed 3 months ago
For the record, I've disabled Daint CI for the moment (Daint will disappear soon). Changed the jenkins-cscs in https://github.com/cp2k/dbcsr/settings/access to be "Triage" role (requires "Write" to trigger the CI)
It turns out a generic kernel is hard to make it without the proper tuning procedure (see https://storage.googleapis.com/cp2k-ci/run-cp2k-cuda-pascal-47248377_report.txt for some of the failing kernels). Since we do test all generated kernels, the new proposed workflow is:
As far as I can see, only 56 tests are failing in the CP2K-CI on P100 (apparently with large kernels, e.g .55x55x14
, 55x55x26
, 38x38x2
).
In the future, I would consider to switch to cublas for the case 3. For the moment, I think this is a step forward.
As far as I can see, only 56 tests are failing in the CP2K-CI on P100 (apparently with large kernels, e.g .
55x55x14
,55x55x26
,38x38x2
).
I think this is due to inappropriate implementation for larger kernels like too many registers used. I believe this goes along with long JIT-compilation of such kernel, i.e., the compiler goes crazy to avoid/implement spilling the excess registers. A way to circumvent this, is to implement a max-size inside of the kernel and to branch into a different flavor for larger kernels.
In the future, I would consider to switch to cublas for the case 3. For the moment, I think this is a step forward.
Yes, aggree. Though, I have not implemented calling MKL for GPUs in case of the OpenCL backend. However, the OpenCL backend validates for all kernel sizes up to the static maximum we have set for all GPUs.
That pushes the computation to the ACC (previously it was falling-back to the CPU with a big performance penalty).
As a note for other readers, a big penalty is due to the data already uploaded to the GPU rather then per-se slow CPU performance.
As far as I can see, only 56 tests are failing in the CP2K-CI on P100 (apparently with large kernels, e.g .
55x55x14
,55x55x26
,38x38x2
).I think this is due to inappropriate implementation for larger kernels like too many registers used. I believe this goes along with long JIT-compilation of such kernel, i.e., the compiler goes crazy to avoid/implement spilling the excess registers. A way to circumvent this, is to implement a max-size inside of the kernel and to branch into a different flavor for larger kernels.
Yeap, your analysis is correct. I've decided to add a limit of kernel size (any dimension > 50). Otherwise, the compiler cannot compile (we get the error that cannot load the PTX):
CUDA DRIVER API ERROR: ModuleLoadDataEx failed with error CUDA_ERROR_INVALID_PTX (/opt/cp2k/exts/dbcsr/src/acc/libsmm_acc/libsmm_acc.cpp::181)
My speculation is that the JIT will mostly fail for large kernels. The entire procedure needs more checking.
A side note, it is risky to use kernels from previous architectures, unless we test them on the new one.
CUDA DRIVER API ERROR: ModuleLoadDataEx failed with error CUDA_ERROR_INVALID_PTX (/opt/cp2k/exts/dbcsr/src/acc/libsmm_acc/libsmm_acc.cpp::181)
... and you can be very happy about this error. Worst case is, it compiles a long time (much longer than normal) and produces broken code. Hat down for Nv's toolchain to know when it failed!
My speculation is that the JIT will mostly fail for large kernels. The entire procedure needs more checking.
Yes, there is remaining risk but not necessarily attributed to JIT compilation. Though, it's generally the same toolchain as the offline compiler. However, larger kernels really needs an implementation that intrinsically limits the register usage; the branch to decide about its use is cheap.
A side note, it is risky to use kernels from previous architectures, unless we test them on the new one.
The CUDA/HIP used several different kernel implementations. Did you choose only one of them for the generic kernel? My guess is if all of them got into the generic kernel (along with appropriate conditions), then you will not see failing kernels. The "appropriate" conditions however can be tricky. It can help "to learn" from the tuned cases when to select either flavor, e.g., it might not be "the size" (like M*N*K) but rather just M or N, or combination of, or something else. You will end up hard-coding some basic rules and then we can actually throw away the whole offline-prediction ;-)
A side note, it is risky to use kernels from previous architectures, unless we test them on the new one.
The CUDA/HIP used several different kernel implementations. Did you choose only one of them for the generic kernel? My guess is if all of them got into the generic kernel (along with appropriate conditions), then you will not see failing kernels. The "appropriate" conditions however can be tricky. It can help "to learn" from the tuned cases when to select either flavor, e.g., it might not be "the size" (like MNK) but rather just M or N, or combination of, or something else. You will end up hard-coding some basic rules and then we can actually throw away the whole offline-prediction ;-)
So, I went for the most used kernel type ("medium") and tried to figure the dependencies on the other parameters (tile_m, tile_n, threads, groupping, w, v). I tried several branches depending on the size, but it is hard (especially for rectangular blocks). In particular, for large kernels (55x55x55 for example), any combination is failing (I ended up doing the autotuning, almost all generated kernels combinations are failing. We cannot make it in production).
Now I'm testing an exteme case: always use the generic kernel. Let's see what the CP2K-CI tell us...
OK, the generic kernel passes all tests in the CP2K-CI (see https://storage.googleapis.com/cp2k-ci/run-cp2k-cuda-pascal-48ac3b6d_report.txt ). The wrong result is pre-existing to the PR.
This PR introduces a generic (untuned) kernel for the ACC, if the tuned kernel is not present. That pushes the computation to the ACC (previously it was falling-back to the CPU with a big performance penalty).
The output changes accordingly, e.g.: