Closed therault closed 1 year ago
Why checking the correctness disable the warmup ? It can be a good stress case to check correctness while doing the entire workflow.
I'd like to be able to disable warmup (sometimes I just want a quick run, or I'm launching a problem size that takes 2 hours on a machine were hours are counted)
I'll be against having no warmup and generating some performance numbers. You don't care about performance, just run the warmup, but you should not be allowed to do the opposite.
After talking about this on 06/10/22 we decided to run a smaller, single node bench on each node as a warmup.
Any progress on this ?
I promised I'd share the code I used to evaluate on HIP, in this version I do two steps:
I note that the sub matrix approach is simpler than what is in this PR.
diff --git a/tests/testing_zpotrf.c b/tests/testing_zpotrf.c
index 5245adb9..48048b5b 100644
--- a/tests/testing_zpotrf.c
+++ b/tests/testing_zpotrf.c
@@ -39,6 +39,23 @@ int main(int argc, char ** argv)
parsec_matrix_sym_block_cyclic, (&dcA, PARSEC_MATRIX_COMPLEX_DOUBLE,
rank, MB, NB, LDA, N, 0, 0,
N, N, P, nodes/P, uplo));
+
+
+ printf("+++ Warming up matrix ... \n");
+ SYNC_TIME_START();
+ dplasma_zplghe(parsec, (double)(N), uplo, &dcA, random_seed);
+ dplasma_zwarmup(parsec, &dcA);
+ SYNC_TIME_PRINT(rank, ("WARMUP_blocks\tPxQpg %3d %-3d %d NB= % 4d N=%7d\n", \
+ P, Q, gpus, NB, N));
+ if(loud > 3) printf("+++ Warming up cublas/rocm ... \n");
+ SYNC_TIME_START();
+ parsec_matrix_sym_block_cyclic_t *dcW;
+ int Nw = dplasma_imin(N, dplasma_imax(4*MB*P, 4*NB*Q));
+ dcW = parsec_tiled_matrix_submatrix(&dcA, 0, 0, Nw, Nw);
+ dplasma_zpotrf(parsec, uplo, dcW);
+ //TODO: parsec_matrix_destroy(dcW)
+ SYNC_TIME_PRINT(rank, ("WARMUP_potrf\tPxQxg %3d %-3d %d NB= %4d N= %7d\n", \
+ P, Q, gpus, NB, Nw));
int t;
for(t = 0; t < nruns; t++) {
/* matrix (re)generation */
As discussed on 03/31/23 some parts of this PR should be salvaged and integrated into #69. Meanwhile this PR will be closed.
PR based on https://bitbucket.org/icldistcomp/dplasma/pull-requests/88 https://bitbucket.org/icldistcomp/dplasma/pull-requests/89
For all DPLASMA testing drivers that support timing, add support for the --nruns option (defaults to 3 timed run per execution) with warmup loop iteration.
-x forces --nruns to be 0, meaning only the warmup run (without timing information displayed) is executed to prepare the matrices to check.
Each dpalsma tester does nruns+1 iterations of the main operation (sometimes hard to define for operations that involve multiple DAGs, in this case, each is done nruns+1 times), and only the last nruns timing are displayed, to remove artefacts like the cost of initializing the mathematical library.
This patch also introduces some fixes in a few benchmarks (zheev, the CUDA-enabled DTD that did not manage the case where CUDA is compiled-in but there is no CUDA device available, and a few other issues).