OpenMathLib / OpenBLAS

OpenBLAS is an optimized BLAS library based on GotoBLAS2 1.13 BSD version.
http://www.openblas.net
BSD 3-Clause "New" or "Revised" License
6.39k stars 1.5k forks source link

`omatcopy` much slower than `copy` in OMP loop #4902

Open david-cortes opened 1 month ago

david-cortes commented 1 month ago

I'm trying to use cblas_domatcopy to transpose large row-major matrices.

I'm finding that the function is slower than a simple loop of cblas_dcopy calls parallelized with OpenMP (with number of threads set to number of logical cores, otherwise OMP loop is much slower).

Function cblas_domatcopy appears to be especially slower when the inputs have more columns than rows - in this sense, in a dcopy loop, there's also a large timing difference according to whether the copies are by rows of the input or of the output, and I'm guessing that perhaps omatcopy always follows the same order.

(code is provided at the end of this post)

OpenBLAS version: 0.3.26, OpenMP variant.

Code that I'm using for the OMP dcopy loop:

void transpose_mat(const double *A, const int nrows, const int ncols, double *B, int nthreads)
{
    if (nrows >= ncols)
    {
        #pragma omp parallel for schedule(static) num_threads(nthreads)
        for (int row = 0; row < nrows; row++)
            cblas_dcopy(ncols, A + (size_t)row*(size_t)ncols, 1, B + row, nrows);
    }

    else
    {
        #pragma omp parallel for schedule(static) num_threads(nthreads
        for (int col = 0; col < ncols; col++)
            cblas_dcopy(nrows, A + col, ncols, B + (size_t)col*(size_t)nrows, 1);
    }
}
martin-frbg commented 1 month ago

related to #1243 - the current ?matcopy code is indeed just a fairly poorly optimized stopgap implementation