cp2k / dbcsr

DBCSR: Distributed Block Compressed Sparse Row matrix library
https://cp2k.github.io/dbcsr/
GNU General Public License v2.0
135 stars 47 forks source link

DBCSR performs very poorly on GH200, when there are large blocks #795

Open abussy opened 6 months ago

abussy commented 6 months ago

I am currently testing CP2K on the new CSCS machines with GH200 chips. In most cases, DBCSR behaves well (e.g. with the benchmarks/QS/H2O-XXX.inp) tests. However, when large block sizes are involved, DBCSR becomes extremely costly. This seems to be linked to the GPU acceleration. The following data was obtained with the becnhamrks/QS_low_scaling_postHF/32-H2O/H2O-32-RPA-TZ.inp input file, on a single node (4GPUs, 8 ranks per GPU, 8 threads per rank). In turn, CP2K was compiled with and without the -D__DBCSR_ACC flag.

Timings are in seconds, as per the CP2K output file: Total dbcsr_multiply_generic
with -D__DBCSR_ACC 891.327 294.900
without -D__DBCSR_ACC 608.230 18.406

With GPU acceleration enabled, the time spent in DBCSR is increased by more than 15x. Profiling revealed that MPI communication is the main culprit.

I would appreciate any suggestion on how to solve this issue. What I have tried so far:

Building DBCSR without GPU support is not a satisfactory solution, as many other use cases are indeed accelerated. One possible way to address this would be the possibility of disabling DBCSR acceleration at run time, given a keyword in the input file.

hfp commented 5 months ago

Hi Augustin,

I am interested to see if the OpenCL based acceleration in DBCSR can be of use. For some access/dev-time on Alps, you can help me getting this permitted (perhaps private messaging/email). In the past (Daint), OpenCL was not well supported due to GPU mode set to "exclusive" (nvidia-smi) and the ominous environment variable CRAY_CUDA_MPS did not cover toggling the mode. I think it would be good to have this better setup for upcoming Alps. Regarding OpenCL, it's a shot and I can basically tune kernels although the OpenCL backend permits untuned usage (reasonable default kernel parameters). I would also try/tune the new OpenCL support in DBM and bring-up the recipe in CP2K to make this more accessible.

Pretty much all keywords in the &GLOCAL%DBCSR input section of CP2K: no noticeable difference

Same experience. Although bumping the number of MMs per stack can help a bit, but it can also induce imbalance due to unfavorable remainder-work.

Mapping all DBCSR calls to DBM: it helps for this benchmark, but it is still slower than DBCSR on CPUs. Additionally, it slows down the benchmarks/QS/H2O-XXX.inp tests.

Can you elaborate on how to achieve this (other than for work going through TAS/DBM directly)? Perhaps this is something to become a more regular choice rather than code changes.

Mapping all DBCSR calls to DBM: it helps for this benchmark, but it is still slower than DBCSR on CPUs.

This is entirely possible with contemporary higher-end CPUs. My experience is, if the system contains multiple GPUs anyway, one can harvest them "for free" and get beyond a contemporary high-end CPU in the same system. If the CPU was chosen weaker on purpose (due to emphasis on GPU), the picture can turn in favor of the GPU(s). This is of course more emphasized if the workload has a high portion of DBT/DBM otherwise it's an uphill battle against Amdahl's law.

Tuned new DBCSR kernels for the H100 GPU architecture. I am currently using kernels for A100. There was no noticeable difference.

ACK. You can at least compile the A100 kernels with compute capability corresponding to H100. In any case, I would not expect big impact. Also, consider contributing your tuned parameters.

One possible way to address this would be the possibility of disabling DBCSR acceleration at run time, given a keyword in the input file.

That would be welcome.

hfp commented 5 months ago

With GPU acceleration enabled, the time spent in DBCSR is increased by more than 15x. Profiling revealed that MPI communication is the main culprit.

I had this for CP2K/DBM recently as well like one of the MPI-enabled function appearing high in the profile (it was even intra-node) in one of our labs but not in the other (same CPU kind). I blamed this to F's ALLOCATE being much slower due to compiler or more likely to the OS flavor. One resolution was to LD_PRELOAD an alternative more scalable malloc implementation, e.g., TBB's malloc proxy. Btw, I have not found time to fix this particular issue at code level let alone upstreaming a change (my plan was to take a look at OpenMP's memory allocation as this is a established programming model in CP2K).

abussy commented 5 months ago

Hi Hans, thanks a lot for all these insights!

I tried building DBCSR with OpenCL, but it seems the cuda does not provide OpenCL on aarch64 at the momemt (e.g. here). If you happen to know a way around it, I'd be happy to try.

I have a branch where I experimented with offloading DBCSR calls to DBM (see cp_dbcsr_multiplication.F). As things stand, it is not ideal because each dbcsr_multiply call involves a copy of the DBCSR matrix to a DBM one. From my tests, this seems to be fairly affordable, but certainly not ideal. Feel free to try it. Note also that DBCSR still has more features than DBM, so complex matrices, or multiplications involving sub-matrices, are still done in DBCSR.

I've tuned the H100 kernels based on the A100 options. However, the A100 parameters are still way more complete, as they also include predicted kernels. I have not been able to run the predicting framework, I think because of filesystem limitations. So at this point, the A100 kernels are still better.

I'll see if I can try your malloc solution, that's an interesting one!

alazzaro commented 5 months ago

Update: @abussy shared (in private) the CP2K logs with me and I gave a fast look to them. The drop in performance is due to a corner case of the test where the stack size is too small (52 in average!) and we have large blocks (a lot of single computation). Nothing related to the GPU kernels itself, basically the library is not meant for such cases... Suggested some options, otherwise I think the CPU switch flag can be a good idea...

BTW, @hfp any libxsmm for ARM to be included in CP2K?

hfp commented 5 months ago

BTW, @hfp any libxsmm for ARM to be included in CP2K?

I will work on it. I have a few PRs pending for LIBXSMM; ideally, this should happen asap.

oschuett commented 5 months ago

I have a branch where I experimented with offloading DBCSR calls to DBM (see cp_dbcsr_multiplication.F). ... From my tests, this seems to be fairly affordable, but certainly not ideal.

That's super interesting! I didn't think an incremental migration would be feasible. I'll look into this.

Note also that DBCSR still has more features than DBM, so complex matrices, or multiplications involving sub-matrices, are still done in DBCSR.

Sub-matrix should be fairly easy to add and complex matrices are only used by CP2K in ~3 places which can be refactored.

abussy commented 5 months ago

I'll continue the discussion I had with @alazzaro here, so that everybody who is interested can follow.

I was asked to test running with export DBCSR_MULTREC_LIMIT=1048576 and/or with a single OMP thread. Here is what I get from this experiment: simply setting export DBCSR_MULTREC_LIMIT=1048576 does nothing for the timings. However, when running with a single OMP thread, the CPU and GPU versions of DBCSR yield very similar timings on 1 node:

Total dbcsr_multiply_generic
with -D__DBCSR_ACC 1078.298 77.591
without -D__DBCSR_ACC 1048.504 39.448

Going from 1 thread to 8 makes dbcsr_multiply calls ~4x more expensive. The overall timings are slower due to other parts of the code not benefiting from OMP. On multiple nodes, the CPU version scales slightly better.

I am not sure that running with many MPI ranks and a small number of OMP threads is always a good solution though. There are 72 cores per GPU on GH200, and oversubscribing the GPU too much can be detrimental too. Also, if we go to multiple nodes, we might run into scaling issues due to the large number of ranks.

@hfp I also tried TBB's malloc proxy. I only got marginal gains for this benchmark though.

abussy commented 5 months ago

This case can be solved by setting the environment variable DBCSR_N_STACKS=0. Then, the GPU accelerated version of DBCSR behaves normally again (negligible timings). Note that this issue also triggered PR #801.

hfp commented 5 months ago

I tried building DBCSR with OpenCL, but it seems the cuda does not provide OpenCL on aarch64 at the momemt (e.g. here). If you happen to know a way around it, I'd be happy to try.

On x86, NVidia's implementation of OpenCL is simply part of every CUDA installation (which in turn can be part of an NVHPC installation). However, I had an issue like yours on a Jetson-AGX system (aarch64 as well) quite some time ago. It's an embedded system with customized OS. My solution at that time was upgrading it to stock-Ubuntu. Of course, that's not a solution in your case. I think it can be useful to get ALPS setup with OpenCL (support request). For the time being, can you check if the CUDA installation simply carries OpenCL? Perhaps something like which nvcc gets you to the point of installation, and once more find /path/to/cuda -type f -name libOpenCL.so*.

abussy commented 5 months ago

I can confirm to you that OpenCL is not distributed with CUDA on Alps. I'll get the word out, and we'll see if somebody comes up with something.

abussy commented 5 months ago

PR #801 solves this issue. While this is not an automatic fix, it allows the user to run efficiently when encountering this issue (by setting a environment variable).

alazzaro commented 5 months ago

Let's keep it open for future improvements...

Schroedingers0216 commented 4 months ago

To measure the execution time of the dbcsr_multiply_generic module in CP2K, what settings do I need to configure?

alazzaro commented 4 months ago

To measure the execution time of the dbcsr_multiply_generic module in CP2K, what settings do I need to configure?

Just look the CP2K output timings and search for dbcsr_multiply_generic, e.g.:

 dbcsr_multiply_generic            2286 12.5    0.133    0.133   26.843   26.896

The last two columns are the inclusive time (average across ranks, max for all ranks).