Open abussy opened 6 months ago
Hi Augustin,
I am interested to see if the OpenCL based acceleration in DBCSR can be of use. For some access/dev-time on Alps, you can help me getting this permitted (perhaps private messaging/email). In the past (Daint), OpenCL was not well supported due to GPU mode set to "exclusive" (nvidia-smi
) and the ominous environment variable CRAY_CUDA_MPS
did not cover toggling the mode. I think it would be good to have this better setup for upcoming Alps. Regarding OpenCL, it's a shot and I can basically tune kernels although the OpenCL backend permits untuned usage (reasonable default kernel parameters). I would also try/tune the new OpenCL support in DBM and bring-up the recipe in CP2K to make this more accessible.
Pretty much all keywords in the &GLOCAL%DBCSR input section of CP2K: no noticeable difference
Same experience. Although bumping the number of MMs per stack can help a bit, but it can also induce imbalance due to unfavorable remainder-work.
Mapping all DBCSR calls to DBM: it helps for this benchmark, but it is still slower than DBCSR on CPUs. Additionally, it slows down the benchmarks/QS/H2O-XXX.inp tests.
Can you elaborate on how to achieve this (other than for work going through TAS/DBM directly)? Perhaps this is something to become a more regular choice rather than code changes.
Mapping all DBCSR calls to DBM: it helps for this benchmark, but it is still slower than DBCSR on CPUs.
This is entirely possible with contemporary higher-end CPUs. My experience is, if the system contains multiple GPUs anyway, one can harvest them "for free" and get beyond a contemporary high-end CPU in the same system. If the CPU was chosen weaker on purpose (due to emphasis on GPU), the picture can turn in favor of the GPU(s). This is of course more emphasized if the workload has a high portion of DBT/DBM otherwise it's an uphill battle against Amdahl's law.
Tuned new DBCSR kernels for the H100 GPU architecture. I am currently using kernels for A100. There was no noticeable difference.
ACK. You can at least compile the A100 kernels with compute capability corresponding to H100. In any case, I would not expect big impact. Also, consider contributing your tuned parameters.
One possible way to address this would be the possibility of disabling DBCSR acceleration at run time, given a keyword in the input file.
That would be welcome.
With GPU acceleration enabled, the time spent in DBCSR is increased by more than 15x. Profiling revealed that MPI communication is the main culprit.
I had this for CP2K/DBM recently as well like one of the MPI-enabled function appearing high in the profile (it was even intra-node) in one of our labs but not in the other (same CPU kind). I blamed this to F's ALLOCATE being much slower due to compiler or more likely to the OS flavor. One resolution was to LD_PRELOAD an alternative more scalable malloc implementation, e.g., TBB's malloc proxy. Btw, I have not found time to fix this particular issue at code level let alone upstreaming a change (my plan was to take a look at OpenMP's memory allocation as this is a established programming model in CP2K).
Hi Hans, thanks a lot for all these insights!
I tried building DBCSR with OpenCL, but it seems the cuda does not provide OpenCL on aarch64
at the momemt (e.g. here). If you happen to know a way around it, I'd be happy to try.
I have a branch where I experimented with offloading DBCSR calls to DBM (see cp_dbcsr_multiplication.F). As things stand, it is not ideal because each dbcsr_multiply
call involves a copy of the DBCSR matrix to a DBM one. From my tests, this seems to be fairly affordable, but certainly not ideal. Feel free to try it. Note also that DBCSR still has more features than DBM, so complex matrices, or multiplications involving sub-matrices, are still done in DBCSR.
I've tuned the H100 kernels based on the A100 options. However, the A100 parameters are still way more complete, as they also include predicted kernels. I have not been able to run the predicting framework, I think because of filesystem limitations. So at this point, the A100 kernels are still better.
I'll see if I can try your malloc solution, that's an interesting one!
Update: @abussy shared (in private) the CP2K logs with me and I gave a fast look to them. The drop in performance is due to a corner case of the test where the stack size is too small (52 in average!) and we have large blocks (a lot of single computation). Nothing related to the GPU kernels itself, basically the library is not meant for such cases... Suggested some options, otherwise I think the CPU switch flag can be a good idea...
BTW, @hfp any libxsmm for ARM to be included in CP2K?
BTW, @hfp any libxsmm for ARM to be included in CP2K?
I will work on it. I have a few PRs pending for LIBXSMM; ideally, this should happen asap.
I have a branch where I experimented with offloading DBCSR calls to DBM (see cp_dbcsr_multiplication.F). ... From my tests, this seems to be fairly affordable, but certainly not ideal.
That's super interesting! I didn't think an incremental migration would be feasible. I'll look into this.
Note also that DBCSR still has more features than DBM, so complex matrices, or multiplications involving sub-matrices, are still done in DBCSR.
Sub-matrix should be fairly easy to add and complex matrices are only used by CP2K in ~3 places which can be refactored.
I'll continue the discussion I had with @alazzaro here, so that everybody who is interested can follow.
I was asked to test running with export DBCSR_MULTREC_LIMIT=1048576
and/or with a single OMP thread. Here is what I get from this experiment: simply setting export DBCSR_MULTREC_LIMIT=1048576
does nothing for the timings. However, when running with a single OMP thread, the CPU and GPU versions of DBCSR yield very similar timings on 1 node:
Total | dbcsr_multiply_generic | |
---|---|---|
with -D__DBCSR_ACC |
1078.298 | 77.591 |
without -D__DBCSR_ACC |
1048.504 | 39.448 |
Going from 1 thread to 8 makes dbcsr_multiply
calls ~4x more expensive. The overall timings are slower due to other parts of the code not benefiting from OMP. On multiple nodes, the CPU version scales slightly better.
I am not sure that running with many MPI ranks and a small number of OMP threads is always a good solution though. There are 72 cores per GPU on GH200, and oversubscribing the GPU too much can be detrimental too. Also, if we go to multiple nodes, we might run into scaling issues due to the large number of ranks.
@hfp I also tried TBB's malloc proxy. I only got marginal gains for this benchmark though.
This case can be solved by setting the environment variable DBCSR_N_STACKS=0
. Then, the GPU accelerated version of DBCSR behaves normally again (negligible timings). Note that this issue also triggered PR #801.
I tried building DBCSR with OpenCL, but it seems the cuda does not provide OpenCL on aarch64 at the momemt (e.g. here). If you happen to know a way around it, I'd be happy to try.
On x86, NVidia's implementation of OpenCL is simply part of every CUDA installation (which in turn can be part of an NVHPC installation). However, I had an issue like yours on a Jetson-AGX system (aarch64
as well) quite some time ago. It's an embedded system with customized OS. My solution at that time was upgrading it to stock-Ubuntu. Of course, that's not a solution in your case. I think it can be useful to get ALPS setup with OpenCL (support request). For the time being, can you check if the CUDA installation simply carries OpenCL? Perhaps something like which nvcc
gets you to the point of installation, and once more find /path/to/cuda -type f -name libOpenCL.so*
.
I can confirm to you that OpenCL is not distributed with CUDA on Alps. I'll get the word out, and we'll see if somebody comes up with something.
PR #801 solves this issue. While this is not an automatic fix, it allows the user to run efficiently when encountering this issue (by setting a environment variable).
Let's keep it open for future improvements...
To measure the execution time of the dbcsr_multiply_generic module in CP2K, what settings do I need to configure?
To measure the execution time of the dbcsr_multiply_generic module in CP2K, what settings do I need to configure?
Just look the CP2K output timings and search for dbcsr_multiply_generic
, e.g.:
dbcsr_multiply_generic 2286 12.5 0.133 0.133 26.843 26.896
The last two columns are the inclusive time (average across ranks, max for all ranks).
I am currently testing CP2K on the new CSCS machines with GH200 chips. In most cases, DBCSR behaves well (e.g. with the
benchmarks/QS/H2O-XXX.inp
) tests. However, when large block sizes are involved, DBCSR becomes extremely costly. This seems to be linked to the GPU acceleration. The following data was obtained with thebecnhamrks/QS_low_scaling_postHF/32-H2O/H2O-32-RPA-TZ.inp
input file, on a single node (4GPUs, 8 ranks per GPU, 8 threads per rank). In turn, CP2K was compiled with and without the-D__DBCSR_ACC
flag.-D__DBCSR_ACC
-D__DBCSR_ACC
With GPU acceleration enabled, the time spent in DBCSR is increased by more than 15x. Profiling revealed that MPI communication is the main culprit.
I would appreciate any suggestion on how to solve this issue. What I have tried so far:
&GLOCAL%DBCSR
input section of CP2K: no noticeable differencebenchmarks/QS/H2O-XXX.inp
tests.Building DBCSR without GPU support is not a satisfactory solution, as many other use cases are indeed accelerated. One possible way to address this would be the possibility of disabling DBCSR acceleration at run time, given a keyword in the input file.