ICLDisco / dplasma

DPLASMA is a highly optimized, accelerator-aware, implementation of a dense linear algebra package for distributed heterogeneous systems. It is designed to deliver sustained performance for distributed systems where each node featuring multiple sockets of multicore processors, and if available, accelerators, using the PaRSEC runtime as a backend.
Other
11 stars 9 forks source link

DPLASMA Warmup -- 2nd try #69

Closed therault closed 1 year ago

therault commented 1 year ago

This is a second try for solving the warmup issue in DPLASMA (especially in CUDA codes).

Here are some performance measurements of the approach proposed in this PR, on Leconte (8x V100):

dpotrf-leconte-warmup-vs-nowarmup-avg-white

dpotrf-leconte-warmup-vs-nowarmup-details-white

'gflops/avg' represents the ratio 'gflops of this run' divided by the 'appropriate average' (meaning the average without the outlier on runs without warmup, and the average of all measured points on runs with warmup).

There is still some warmup problem that is unidentified, at large tile size (512, 1024), for 1 to 4 GPUs, the first actual run is still slower than the others for the small problem sizes. It's unclear what is the source of the issue at this point, but the warmup patch fixes most of the CUDA/CUBLAS warmup issues.

The goal of the current code is to include changes for all tests that feature a CUDA implementation and timing:

TRSM is the last kernel that features a CUDA implementation, and it does not have timing in its testings.

Aurelien has notified during the discusssion of an issue with HIP: allocation of memory on the HIP device was lazy at some point, and allocation at first touch is a significant part of the warmup overheads of the HIP runs. We decided that this should be solved at the PaRSEC level, during memory allocation, and not at the DPLASMA warmup level.

therault commented 1 year ago

Done: all PTG tests that are CUDA-enabled

To do:

bosilca commented 1 year ago

As discussed on 03/31/23 we need to

therault commented 1 year ago
abouteiller commented 1 year ago
  • GPU load statistics should be reset after the completion of each warmup test. Shouldn't that happen automatically via the completion of tasks? parsec#01592dc6 adds a function to do that (explicit call)

This is done in #89