ICLDisco / dplasma

DPLASMA is a highly optimized, accelerator-aware, implementation of a dense linear algebra package for distributed heterogeneous systems. It is designed to deliver sustained performance for distributed systems where each node featuring multiple sockets of multicore processors, and if available, accelerators, using the PaRSEC runtime as a backend.
Other
11 stars 9 forks source link

[BBT#89] Dplasma warming run #48

Closed therault closed 1 year ago

therault commented 2 years ago

PR based on https://bitbucket.org/icldistcomp/dplasma/pull-requests/88 https://bitbucket.org/icldistcomp/dplasma/pull-requests/89

For all DPLASMA testing drivers that support timing, add support for the --nruns option (defaults to 3 timed run per execution) with warmup loop iteration.

-x forces --nruns to be 0, meaning only the warmup run (without timing information displayed) is executed to prepare the matrices to check.

Each dpalsma tester does nruns+1 iterations of the main operation (sometimes hard to define for operations that involve multiple DAGs, in this case, each is done nruns+1 times), and only the last nruns timing are displayed, to remove artefacts like the cost of initializing the mathematical library.

This patch also introduces some fixes in a few benchmarks (zheev, the CUDA-enabled DTD that did not manage the case where CUDA is compiled-in but there is no CUDA device available, and a few other issues).

bosilca commented 2 years ago

Why checking the correctness disable the warmup ? It can be a good stress case to check correctness while doing the entire workflow.

bosilca commented 2 years ago

I'd like to be able to disable warmup (sometimes I just want a quick run, or I'm launching a problem size that takes 2 hours on a machine were hours are counted)

I'll be against having no warmup and generating some performance numbers. You don't care about performance, just run the warmup, but you should not be allowed to do the opposite.

bosilca commented 2 years ago

After talking about this on 06/10/22 we decided to run a smaller, single node bench on each node as a warmup.

bosilca commented 2 years ago

Any progress on this ?

abouteiller commented 2 years ago

I promised I'd share the code I used to evaluate on HIP, in this version I do two steps:

  1. I go over the full matrix with a new JDF 'zwarmup' that essentially reads all tiles of A from a GPU (n^2 memory traffic cost, GPU selected by the scheduler so randomly).
  2. use the submatrix concept to perform a smaller PO before going for the full version

I note that the sub matrix approach is simpler than what is in this PR.

diff --git a/tests/testing_zpotrf.c b/tests/testing_zpotrf.c
index 5245adb9..48048b5b 100644
--- a/tests/testing_zpotrf.c
+++ b/tests/testing_zpotrf.c
@@ -39,6 +39,23 @@ int main(int argc, char ** argv)
         parsec_matrix_sym_block_cyclic, (&dcA, PARSEC_MATRIX_COMPLEX_DOUBLE,
                                    rank, MB, NB, LDA, N, 0, 0,
                                    N, N, P, nodes/P, uplo));
+
+
+    printf("+++ Warming up matrix ... \n");
+    SYNC_TIME_START();
+    dplasma_zplghe(parsec, (double)(N), uplo, &dcA, random_seed);
+    dplasma_zwarmup(parsec, &dcA);
+    SYNC_TIME_PRINT(rank, ("WARMUP_blocks\tPxQpg %3d %-3d %d NB= % 4d N=%7d\n", \
+                P, Q, gpus, NB, N));
+    if(loud > 3) printf("+++ Warming up cublas/rocm ... \n");
+    SYNC_TIME_START();
+    parsec_matrix_sym_block_cyclic_t *dcW;
+    int Nw = dplasma_imin(N, dplasma_imax(4*MB*P, 4*NB*Q));
+    dcW = parsec_tiled_matrix_submatrix(&dcA, 0, 0, Nw, Nw);
+    dplasma_zpotrf(parsec, uplo, dcW);
+    //TODO: parsec_matrix_destroy(dcW)
+    SYNC_TIME_PRINT(rank, ("WARMUP_potrf\tPxQxg %3d %-3d %d NB= %4d N= %7d\n", \
+                P, Q, gpus, NB, Nw));
     int t;
     for(t = 0; t < nruns; t++) {
         /* matrix (re)generation */
bosilca commented 1 year ago

As discussed on 03/31/23 some parts of this PR should be salvaged and integrated into #69. Meanwhile this PR will be closed.