Reference-LAPACK / lapack

LAPACK development repository
Other
1.49k stars 436 forks source link

cblas_dgbmv: Result of row-major is not consistent to column-major #1032

Closed ajz34 closed 2 months ago

ajz34 commented 2 months ago

Description

cblas_dgbmv function (banded y = alpha op(A) x + beta y) seems not showing consistent results between column-major and row-major.

I guess that simply changing transposition flag in the following code (for rea/double cases) may work for general multiplication, but not suitable for banded multiplication: https://github.com/Reference-LAPACK/lapack/blob/694e3375b0e61c70c98ef9cb46013170cd99f2a5/CBLAS/src/cblas_dgbmv.c#L62-L64

I feel that, given that CBLAS has been proposed as very basic linear library since last century, possibility of a bug is quite small (or either cblas_?gbmv functions are seldomly used by other projects). So I'm not sure whether this is a bug, or I thought in wrong way. But I can't convince myself on this problem.

Minimal working example

In the following code, I can confirm that by changing matrix A from column-major to row-major, cblas_dgemv gives the correct result. While for cblas_dgbmv, results of column-major and row-major does not seems to be the same, which is probably not desired by the design of CBLAS.

#include <stdio.h>
#include "cblas.h"

void print_array(double *arr, int n) {
    for (int i = 0; i < n; i++) {
        printf("%5.0f", arr[i]);
    }
    printf("\n");
}

int main() {
    const int m = 10;
    const int n = 8;
    const int k = 7;
    const int kl = 2;
    const int ku = 4;
    double A[k * n];
    const int incx = 1;
    const int incy = 1;
    double alpha = 1.0;
    double beta = 1.0;
    double x[n]; for (int i = 0; i < n; ++i) x[i] = i;
    double y[m];
    double z[k];
    int lda;

    /*  Testing two problems:
        1. y = band(A) x + y
            A: k * n (before band); band(A): m * n (after band); x: n; y: m
        2. z = A x + z
            A: k * n; x: n; z: k
     */

    // ColMajor Indexing
    for (int i = 0; i < k; i++)
        for (int j = 0; j < n; j++)
            A[j * k + i] = 10 * i + j;
    lda = k;
    for (int i = 0; i < m; ++i) y[i] = i;
    for (int i = 0; i < k; ++i) z[i] = i;
    cblas_dgbmv(CblasColMajor, CblasNoTrans, m, n, kl, ku, alpha, A, lda, x, incx, beta, y, incy);
    cblas_dgemv(CblasColMajor, CblasNoTrans, k, n, alpha, A, lda, x, incx, beta, z, incy);
    printf("y = band(A) x  "); print_array(y, m);
    printf("z =      A  x  "); print_array(z, k);
    /* Output
        y = band(A) x    130  256  443  703  913 1040 1072  997  803  478
        z =      A  x    140  421  702  983 1264 1545 1826
     */

    // RowMajor Indexing
    for (int i = 0; i < k; i++)
        for (int j = 0; j < n; j++)
            A[i * n + j] = 10 * i + j;
    lda = n;
    for (int i = 0; i < m; ++i) y[i] = i;
    for (int i = 0; i < k; ++i) z[i] = i;
    cblas_dgbmv(CblasRowMajor, CblasNoTrans, m, n, kl, ku, alpha, A, lda, x, incx, beta, y, incy);
    cblas_dgemv(CblasRowMajor, CblasNoTrans, k, n, alpha, A, lda, x, incx, beta, z, incy);
    printf("y = band(A) x  "); print_array(y, m);
    printf("z =      A  x  "); print_array(z, k);
    /* Output
        y = band(A) x     50  221  513  955 1169 1315 1364    7    8    9
        z =      A  x    140  421  702  983 1264 1545 1826
     */

    return 0;
}

Checklist

ajz34 commented 2 months ago

Additional note This program is linked with OpenBLAS v0.3.27, instead of reference CBLAS. However, CBLAS implementation of cblas_dgbmv should be virtually the same.

martin-frbg commented 2 months ago

The CBLAS implementation (wrapper) may be functionally the same (I have not checked it yet), but the underlying implementation of GBMV is certainly different

vladimir-ch commented 2 months ago

You're using the LAPACKE way of storing a banded matrix - take a rectangle of matrix entries within the band, columns in columns, diagonals in rows, and store that rectangle either in column- or row- major order.

Unfortunately and very surprisingly, CBLAS uses a different storage:

I don't know how this historically happened (I'd be curious to know) but that's how it is.

ajz34 commented 2 months ago

@vladimir-ch Seems like that. In other words, dimension that represents (kl+ku+1) is always contiguous in memory, in current implementation of CBLAS convention.

Also, it seems that current implementation of CBLAS convention is different to LAPACKE convention. For example of lapacke_dgbsv_work.c, when input matrix is row-major, it simply make explicit memory clone to col-major, then export output result by memory to row-major again. In this way, results of LAPACK_ROW_MAJOR can be the same to LAPACK_COL_MAJOR, which seems to be more appearant to me.

In most cases, LAPACK involves BLAS-3 to perform $O(N^3)$ computations; so (perhaps) explicit matrix transposition of $O(N^2)$ memory-bounded is acceptable. However, for BLAS-2 (such as DGBMV), the function itself is $O(N^2)$ memory-bouonded computation; explicit matrix transposition not only increases memory cost, but also expected to double computation time. So it is also reasonable for CBLAS not performing explicit memory transposition.

So after your reply, I thought I'm not sure whether this should be considered as bug, or a CBLAS-specific convention 😂

disclamer: I'm working on some codes that wrap BLAS functions, so I found this convention inconsistency (or perhaps bug). For the tasks I'm actually dealing with, I only need dense/symm/triangular BLAS-3 (or its extensions); so this inconsistency actually does not affect my work. But it's still surprising to find that some CBLAS functions does not work as the same way as I used to believe.

vladimir-ch commented 2 months ago

The CBLAS convention for band matrices is well documented, see

so it's definitely not a CBLAS bug.

The only LAPACKE documentation I've found doesn't explicitly mention band matrices, so the reference implementation is practically the documentation. And LAPACKE_dgb_trans clearly assumes the input array to be (ku+kl+1)*n for both column- and row-major layouts and it does a plain transpose of the input rectangular array, copying only the referenced elements. All callers of this routine check that ldab < n which further supports the expected shape of the input array and that LAPACKE band matrix layout differs from CBLAS. Even if it's unintended, it cannot be changed without breaking all code that has adapted to this layout.

ajz34 commented 2 months ago

@vladimir-ch Oh this is very clear! I used to learn BLAS functions only from doxygen html, but haven't noticed documents from BLAS technical forum. This issue discussion and BLAS forum documents really helped a lot, and made clarify to understanding to the design conceptions of APIs.

To my understanding to these documents, CBLAS convention prefers layout that contiguous data in full matrix is also contiguous in banded (or triangular packed) matrix, which is more cache friendly to computation. As a side effect, parameter descriptions (in function signature) in legacy BLAS (col-major) functions may not be directly applicable to CBLAS (row-major allowed) in banded or triangular packed cases. This design has already been addressed by discussions of BLAS forum; it is reasonable and not an unintentional bug.