OpenMathLib / OpenBLAS

OpenBLAS is an optimized BLAS library based on GotoBLAS2 1.13 BSD version.
http://www.openblas.net
BSD 3-Clause "New" or "Revised" License
6.35k stars 1.49k forks source link

DGELSD gives incorrect results on v0.3.6 with SKYLAKEX #2187

Closed apataki closed 4 years ago

apataki commented 5 years ago

It seems like the DGELSD least squares solver via SVD produces incorrect results when built with TARGET=SKYLAKEX or DYNAMIC_ARCH=1 and run on a SKYLAKEX node. The processor we use is the Xeon 6148.

Here is a sample code trying to recover an exact result via fitting a small polynomial:

#include <stdio.h>

#define N 512
#define M 8

double X[N];
double Y[N];
double A[M][N];
double S[M];
double work[N*100];
int lwork = N*100;
int iwork[N*100];

void dgelsd_(int *MM, int *NN, int *NRHS, double *A, int *LDA, double *B, int *LDB,
             double *S, double *RCOND, int *RANK, double *WORK, int *LWORK, int *IWORK, int *INFO);

int main()
{
    int i, j;
    double x, dx, xn;

    dx = 1.0 / (N+1);
    for (i = 0; i < N; i++) {
        x = i*dx;
        Y[i] = 2*x + 3*x*x*x;

        // Generate the Vandermonde matrix
        xn = 1.0;
        for (j = 0; j < M; j++) {
            A[j][i] = xn;
            xn *= x;
        }
    }

    int NN = N;
    int MM = M;
    int NRHS = 1;
    int LDB = N;
    double RCOND = 1e-10;
    int r;
    int info;
    dgelsd_(&NN, &MM, &NRHS, A, &NN, Y, &LDB, S, &RCOND, &r, work, &lwork, iwork, &info);

    printf("info: %d\n", info);

    for (int i = 0; i < M; i++) {
        printf("    %9.6f\n", Y[i]);
    }

    return 0;
}

The correct output (on Broadwell for example):

info: 0
    -0.000000
     2.000000
    -0.000000
     3.000000
    -0.000000
     0.000000
    -0.000000
     0.000000

On the 6148 with TARGET=SKYLAKEX I get:

info: 0
    -1.842238
    -1.542870
     1.389274
    -1.695241
     0.884713
    -0.855344
    -0.331920
    -0.538012

If I reduce the polynomial so that M < 8, the relevant matrix dimension becomes < 8 and the code works correctly. 8 being the magic number for AVX-512 and doubles, I wonder if the AVX-512 implementation of some subfunction is incorrect. Interestingly enough, the singular values computed are correct (not printed here).

martin-frbg commented 5 years ago

This is almost certainly another manifestation of the DGEMM bug (#2029), hopefully this new testcase will help to track it down. (Current develop branch simply removes the AVX512 dgemm and helper functions so should be correct again).

apataki commented 5 years ago

Indeed - confirmed that the development branch doesn't show this problem.

apataki commented 5 years ago

I'm a bit confused about how the DGEMM+Skylake issue has been outstanding for almost half a year, and how the current stable release version has such a huge unpatched bug in it for the processor that is currently actively sold for new servers (and the current stable release has a release date that is months newer than the discovery date of this bug). While it is nice that the AVX-512 DGEMM is disabled on the development branch, I doubt that most people would build the development branch for general use.

Almost the most trivial DGEMM example reproduces the problem for me on a Skylake CPU. Note: all the indexing is done in a FORTRAN manner in this test:

#include <stdio.h>
#include <math.h>

#define L  8
#define M  8
#define N  8

double A[M][L];
double B[N][M];

double C1[N][L];
double C2[N][L];

void dgemm_(char *TRANSA, char *TRANSB,
            int *MM, int *NN, int *KK,
            double *ALPHA, double *A, int *LDA, double *B, int *LDB,
            double *BETA, double *C, int *LDC);

int main()
{
    int i, j, k;
    double tmp, err;

    // Fill the inputs with something
    for (i = 0; i < M; i++) {
        for (j = 0; j < L; j++) {
            A[i][j] = i+2*j;
        }
    }
    for (i = 0; i < N; i++) {
        for (j = 0; j < M; j++) {
            B[i][j] = i+2*j;
        }
    }

    // Manual matrix multiplication (FORTRAN indexing order!)
    for (j = 0; j < N; j++) {
        for (i = 0; i < L; i++) {
            tmp = 0;
            for (k = 0; k < M; k++) {
                tmp += A[k][i] * B[j][k];
            }
            C2[j][i] = tmp;
        }
    }

    // DGEMM multiplication
    double alpha = 1.0;
    double beta = 0.0;
    int LL = L;
    int MM = M;
    int NN = N;
    dgemm_("N", "N", &LL, &NN, &MM, &alpha, A, &LL, B, &MM, &beta, C1, &LL);

    // Comparison
    err = 0.0;
    for (i = 0; i < N; i++) {
        for (j = 0; j < L; j++) {
            tmp = C1[i][j] - C2[i][j];
            err += tmp*tmp;
        }
    }
    printf("Error: %10.2e\n", sqrt(err));

    return 0;
}

The output on Broadwell:

Error:   0.00e+00

on Skylake (Xeon Gold 6148):

Error:   1.38e+03
martin-frbg commented 5 years ago

Simply put, there is currently no sizeable team or organization behind OpenBLAS (and never has been), despite the importance that the project appears to have gained over the years. The AVX512 code was contributed by someone (working for Intel no less) who has not commented here since april, the bug reports available at that time were from big packages (R, computational chemistry and the like) with no simple reproducer and based on them I had mistakenly assumed that disabling only parts of the AVX512 code was sufficient to work around the problem in 0.3.6. I had meant to release 0.3.7 about two months ago, but as I am doing this in my spare time that date has unfortunately slipped.

aaichsmn commented 5 years ago

I'm happy to wait and want to commend you for the quality of the distributions I've installed. The fact that they nearly drop in and pass their self tests up to this point has saved me inestimable time that I can use on other projects.

Again, thanks for your effort and good luck getting this one figured out. My assembly coding days ended with the 808x or I'd try to spot the problem myself :-).

martin-frbg commented 5 years ago

@apataki the pure DGEMM bug demonstrated by your latest reproducer may actually be caused by my misguided attempt to fix things (at least I cannot reproduce it in SDE with all parts of the original contribution active). One confusing datapoint from this whole mess is that the AVX512 code appeared to fail when called as DSYMM although it seemed to work fine in the general (DGEMM) case.

martin-frbg commented 4 years ago

Fixed by wjc404's new AVX512 DGEMM kernel from #2286