DPLASMA is a highly optimized, accelerator-aware, implementation of a dense linear algebra package for distributed heterogeneous systems. It is designed to deliver sustained performance for distributed systems where each node featuring multiple sockets of multicore processors, and if available, accelerators, using the PaRSEC runtime as a backend.
For all DPLASMA testing drivers that support timing, add support for the --nruns option (defaults to 3 timed run per execution) with warmup loop iteration.
-x forces --nruns to be 0, meaning only the warmup run (without timing information displayed) is executed to prepare the matrices to check.
Each dpalsma tester does nruns+1 iterations of the main operation (sometimes hard to define for operations that involve multiple DAGs, in this case, each is done nruns+1 times), and only the last nruns timing are displayed, to remove artefacts like the cost of initializing the mathematical library.
This patch also introduces some fixes in a few benchmarks (zheev, the CUDA-enabled DTD that did not manage the case where CUDA is compiled-in but there is no CUDA device available, and a few other issues).
PR based on https://bitbucket.org/icldistcomp/dplasma/pull-requests/88
For all DPLASMA testing drivers that support timing, add support for the --nruns option (defaults to 3 timed run per execution) with warmup loop iteration.
-x forces --nruns to be 0, meaning only the warmup run (without timing information displayed) is executed to prepare the matrices to check.
Each dpalsma tester does nruns+1 iterations of the main operation (sometimes hard to define for operations that involve multiple DAGs, in this case, each is done nruns+1 times), and only the last nruns timing are displayed, to remove artefacts like the cost of initializing the mathematical library.
This patch also introduces some fixes in a few benchmarks (zheev, the CUDA-enabled DTD that did not manage the case where CUDA is compiled-in but there is no CUDA device available, and a few other issues).