Closed therault closed 1 year ago
Done: all PTG tests that are CUDA-enabled
To do:
As discussed on 03/31/23 we need to
[x] cover the CPU only tests (in addition to the device tests)
[ ] need to touch the entire data at least once on the GPU (at the level of PaRSEC data during the allocation stage).
- GPU load statistics should be reset after the completion of each warmup test. Shouldn't that happen automatically via the completion of tasks? parsec#01592dc6 adds a function to do that (explicit call)
This is done in #89
This is a second try for solving the warmup issue in DPLASMA (especially in CUDA codes).
Here are some performance measurements of the approach proposed in this PR, on Leconte (8x V100):
'gflops/avg' represents the ratio 'gflops of this run' divided by the 'appropriate average' (meaning the average without the outlier on runs without warmup, and the average of all measured points on runs with warmup).
There is still some warmup problem that is unidentified, at large tile size (512, 1024), for 1 to 4 GPUs, the first actual run is still slower than the others for the small problem sizes. It's unclear what is the source of the issue at this point, but the warmup patch fixes most of the CUDA/CUBLAS warmup issues.
The goal of the current code is to include changes for all tests that feature a CUDA implementation and timing:
TRSM is the last kernel that features a CUDA implementation, and it does not have timing in its testings.
Aurelien has notified during the discusssion of an issue with HIP: allocation of memory on the HIP device was lazy at some point, and allocation at first touch is a significant part of the warmup overheads of the HIP runs. We decided that this should be solved at the PaRSEC level, during memory allocation, and not at the DPLASMA warmup level.