cp2k / dbcsr

DBCSR: Distributed Block Compressed Sparse Row matrix library
https://cp2k.github.io/dbcsr/
GNU General Public License v2.0
134 stars 45 forks source link

CP2K performs poorly on AMD platforms when using the DBCSR HIP backend. #815

Open zhl201226 opened 5 days ago

zhl201226 commented 5 days ago

I am writing to seek your assistance. When running CP2K simulations on the AMD MI50 platform with the DBCSR backend set to HIP, the execution time is longer than using the CPU. Using HIPprof to examine the API calls, I noticed that the results show a large number of H2D (Host to Device) transfers but no kernel launches. Normally, the call flow should be H2D -> LaunchKernel -> D2H (Device to Host). I would like to understand why there are so many H2D transfers and where in the code this is occurring. Below, I have attached the JSON file for you to open in chrome://tracing.

Thank you.

zhl201226 commented 5 days ago

dbcsr_json.zip

hfp commented 5 days ago

If possible, can you share the input file and perhaps the profile output when running the workload? The profile output is what contains the timings printed by CP2K at the end. What's clear already, this is not only about DBCSR but also CP2K's GRID components (collocate/integrate), perhaps even some PW, etc.

Regarding, "H2D -> LaunchKernel -> D2H" - this is idealized assuming only a single transfer/array is the input of such kernel and in turn for the output/result as well.

zhl201226 commented 4 days ago

I tried setting the DBCSR backend to others, and I didn't find a large number of H2D transfers in HIPprof. Therefore, I believe DBCSR is causing the issue. Additionally, it might be due to the transpose_d kernel. I couldn't locate the specific code responsible for the numerous H2D transfers. Below, I have attached the test file and output file. Thank you. @hfp test.tar.gz

hfp commented 4 days ago

For the record, if there are "unnecessary" data transfers like can be combined or avoided, this issue applies to all backends as well as GPUs/vendors. The hint on transposes might be a first step.

@zhl201226 you may try DBCSR_RUN_ON_GPU=0 environment variable and recapture the GPU-profile. This environment variable disables DBCSR on GPUs even if the support is compiled into the application (and leaves the other uses of CP2K on GPUs intact).

hfp commented 4 days ago

Looking at CP2K's profile, local GEMMs (cp_fm_gemm) consume ~25% of the TTS on this system (just as a note). However, multiply_cannon* and dbcsr_mm_hostdrv_process are interesting. Given dbcsr_mm_hostdrv_process is relatively high, it seems there is a reasonable portion of fallbacks happening. Given previous implementation, the fallbacks may be accompanied by transfers without actually launching a kernel.

zhl201226 commented 4 days ago

I have identified the H2D issue occurring in the dbcsr_mm_accdrv_process module. Is this module dividing the data into small chunks for transfer? Can it be merged into larger chunks for transfer? Additionally, I previously did not use ACC to accelerate DBCSR, but it seems to be taking longer now. Therefore, I am not sure if DBCSR_RUN_ON_GPU=0 is effective. Could you please provide more optimization suggestions?

hfp commented 4 days ago

Sorry, I guess DBCSR_RUN_ON_GPU is only supported in the most recent if not unreleased version. This was not meant as an optimization suggestion but rather something to systematically rule-out or blame DBCSR. Your example input is worth looking at for contributors.

zhl201226 commented 4 days ago

How do I contact contributors? @hfp

hfp commented 4 days ago

Just give some time they will see this open issue ;-)

zhl201226 commented 4 days ago

Just give some time they will see this open issue ;-)

thank you :-)

hfp commented 4 days ago

( Side note, GLOBAL| CPU model name does not show up in the log ;-)

hfp commented 4 days ago

Regarding the test input, it's missing the restart file for the SCF initial guess. Commenting it out, starts from an unreasonable guess and fails in the Cholesky decomposition.

zhl201226 commented 4 days ago

( Side note, GLOBAL| CPU model name does not show up in the log ;-)

"By the way, using DBCSR_RUN_ON_GPU=0 did not significantly improve performance. The CPU model name has been hidden for other reasons, but I can provide it if needed." image

zhl201226 commented 4 days ago

Regarding the test input, it's missing the restart file for the SCF initial guess. Commenting it out, starts from an unreasonable guess and fails in the Cholesky decomposition.

This restart file is too large to upload. Is there another way to send it to you?

hfp commented 4 days ago

Hmm, others may have the same request so Dropbox or something like this comes to mind. My e-mail is my . name @ intel . com.

zhl201226 commented 4 days ago

Hmm, others may have the same request so Dropbox or something like this comes to mind. My e-mail is my . name @ intel . com.

I have already sent it to you via email. thank you

hfp commented 3 days ago

Hmm, others may have the same request so Dropbox or something like this comes to mind. My e-mail is my . name @ intel . com.

I have already sent it to you via email. thank you

( Let's see, the e-mail did not arrive yet perhaps size restrictions )

zhl201226 commented 3 days ago

Hmm, others may have the same request so Dropbox or something like this comes to mind. My e-mail is my . name @ intel . com.

I have already sent it to you via email. thank you

( Let's see, the e-mail did not arrive yet perhaps size restrictions )

I have resent it to my.name@intel.com. Please check it. Best regards

hfp commented 1 hour ago

I have resent it to my.name@intel.com. Please check it. Best regards

Literally? I envisioned my.name would be my name taken from https://github.com/hfp (hans.pabst). Sorry for the confusion.