I'm trying to use cblas_domatcopy to transpose large row-major matrices.
I'm finding that the function is slower than a simple loop of cblas_dcopy calls parallelized with OpenMP (with number of threads set to number of logical cores, otherwise OMP loop is much slower).
Function cblas_domatcopy appears to be especially slower when the inputs have more columns than rows - in this sense, in a dcopy loop, there's also a large timing difference according to whether the copies are by rows of the input or of the output, and I'm guessing that perhaps omatcopy always follows the same order.
(code is provided at the end of this post)
Timings in seconds on an intel 12700H, average of 7 runs:
Input size: 100,000 x 5x000
OpenBLAS cblas_domatcopy: 3.12
OpenMP dcopy loop: 2.38
MKL MKL_Domatcopy: 1.26
Input size: 5,000 x 100,000
OpenBLAS cblas_domatcopy: 3.74
OpenMP dcopy loop: 1.23
MKL MKL_Domatcopy: 1.27
Timings in seconds on an amd ryzen 7840HS, average of 7 runs:
Input size: 100,000 x 5x000
OpenBLAS cblas_domatcopy: 0.922
OpenMP dcopy loop: 0.586
MKL MKL_Domatcopy: 0.560
Input size: 5,000 x 100,000
OpenBLAS cblas_domatcopy: 1.12
OpenMP dcopy loop: 0.402
MKL MKL_Domatcopy: 0.516
OpenBLAS version: 0.3.26, OpenMP variant.
Code that I'm using for the OMP dcopy loop:
void transpose_mat(const double *A, const int nrows, const int ncols, double *B, int nthreads)
{
if (nrows >= ncols)
{
#pragma omp parallel for schedule(static) num_threads(nthreads)
for (int row = 0; row < nrows; row++)
cblas_dcopy(ncols, A + (size_t)row*(size_t)ncols, 1, B + row, nrows);
}
else
{
#pragma omp parallel for schedule(static) num_threads(nthreads
for (int col = 0; col < ncols; col++)
cblas_dcopy(nrows, A + col, ncols, B + (size_t)col*(size_t)nrows, 1);
}
}
I'm trying to use
cblas_domatcopy
to transpose large row-major matrices.I'm finding that the function is slower than a simple loop of
cblas_dcopy
calls parallelized with OpenMP (with number of threads set to number of logical cores, otherwise OMP loop is much slower).Function
cblas_domatcopy
appears to be especially slower when the inputs have more columns than rows - in this sense, in adcopy
loop, there's also a large timing difference according to whether the copies are by rows of the input or of the output, and I'm guessing that perhapsomatcopy
always follows the same order.(code is provided at the end of this post)
Timings in seconds on an intel 12700H, average of 7 runs:
cblas_domatcopy
: 3.12dcopy
loop: 2.38MKL_Domatcopy
: 1.26cblas_domatcopy
: 3.74dcopy
loop: 1.23MKL_Domatcopy
: 1.27Timings in seconds on an amd ryzen 7840HS, average of 7 runs:
cblas_domatcopy
: 0.922dcopy
loop: 0.586MKL_Domatcopy
: 0.560cblas_domatcopy
: 1.12dcopy
loop: 0.402MKL_Domatcopy
: 0.516OpenBLAS version: 0.3.26, OpenMP variant.
Code that I'm using for the OMP
dcopy
loop: