DPLASMA Warmup -- 2nd try

therault commented 1 year ago

This is a second try for solving the warmup issue in DPLASMA (especially in CUDA codes).

Here are some performance measurements of the approach proposed in this PR, on Leconte (8x V100):

dpotrf-leconte-warmup-vs-nowarmup-avg-white

dpotrf-leconte-warmup-vs-nowarmup-details-white

'gflops/avg' represents the ratio 'gflops of this run' divided by the 'appropriate average' (meaning the average without the outlier on runs without warmup, and the average of all measured points on runs with warmup).

There is still some warmup problem that is unidentified, at large tile size (512, 1024), for 1 to 4 GPUs, the first actual run is still slower than the others for the small problem sizes. It's unclear what is the source of the issue at this point, but the warmup patch fixes most of the CUDA/CUBLAS warmup issues.

The goal of the current code is to include changes for all tests that feature a CUDA implementation and timing:

POTRF
GEMM (WIP)
POINV
GEQRF

TRSM is the last kernel that features a CUDA implementation, and it does not have timing in its testings.

Aurelien has notified during the discusssion of an issue with HIP: allocation of memory on the HIP device was lazy at some point, and allocation at first touch is a significant part of the warmup overheads of the HIP runs. We decided that this should be solved at the PaRSEC level, during memory allocation, and not at the DPLASMA warmup level.

therault commented 1 year ago

Done: all PTG tests that are CUDA-enabled

To do:

DTD tests that are CUDA-enabled
Update performance of CUDA-enabled tests on recent machines

bosilca commented 1 year ago

As discussed on 03/31/23 we need to

[x] cover the CPU only tests (in addition to the device tests)
[ ] need to touch the entire data at least once on the GPU (at the level of PaRSEC data during the allocation stage).

therault commented 1 year ago

[x] check what is happening in paranoid debug mode: tests on frontier show that the local data dists generated via the warmup calls trigger (wrongful?) asserts
[x] GPU load statistics should be reset after the completion of each warmup test. Shouldn't that happen automatically via the completion of tasks? parsec#01592dc6 adds a function to do that (explicit call)

abouteiller commented 1 year ago

GPU load statistics should be reset after the completion of each warmup test. Shouldn't that happen automatically via the completion of tasks? parsec#01592dc6 adds a function to do that (explicit call)

This is done in #89

ICLDisco / dplasma

DPLASMA Warmup -- 2nd try #69