flame / blis

BLAS-like Library Instantiation Software Framework
Other
2.2k stars 361 forks source link

sgemm() fails on skx #182

Closed fgvanzee closed 6 years ago

fgvanzee commented 6 years ago

Clement Pernet reports that the following test driver file, sgemm_bug.c

#include <stdlib.h>
#include <stdio.h>
#include "cblas.h"

int main()
{
    float* a = malloc(sizeof(float)*2);
    float* b = malloc(sizeof(float));
    float* c = malloc(sizeof(float)*2);
    a[0] = a[1] = 1;
    b[0] = 2;
    c[0] = c[1] = 0;
    cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
                2, 1, 1, 1.0, a, 1, b, 1, 0.0, c, 1);
    printf("C = [%f %f]\n",c[0],c[1]);
    free(a); free(b); free(c);
}

fails when compiled and linked against an skx-configured BLIS and run via:

$ gcc sgemm_bug.c -L/local/cpernet/builds/lib -lblis -lpthread -lm
$ ./a.out 
C = [2.000000 0.000000]

The correct answer should be [ 2.0 2.0 ].

devinamatthews commented 6 years ago

Hi Clement (I will cross-post to blis-devel): can you tell us precisely which git commit you are working from?

Also, @dnparikh: I can't remember if the code you were porting SKX from had explicit sgemm support? I wrote it at some point but definitely didn't test it as well as dgemm.

fgvanzee commented 6 years ago

I can reproduce this bug on stampede2's skylake nodes, though I had to add the include directory:

$ gcc sgemm_bug.c -I../blis/include/blis ../blis/lib/libblis.a -lpthread -lm
$ ./a.out 
C = [2.000000 0.000000]

EDIT: Can also confirm that disabling the optimized sgemm microkernel for skx (which implicitly enables the reference microkernel) results in the expected answer:

C = [2.000000 2.000000]
ClementPernet commented 6 years ago

@devinamatthews : I tested it on master at commit e2192a8fd58ec3657434ddd407033e097edad8f4

fgvanzee commented 6 years ago

@ClementPernet This bug should be fixed now, as of ae9a5be. (I can no longer reproduce any errors observed prior.) Please try it out and let us know if you encounter any other issues--if you do, just open another issue here on github. Thanks for your feedback (and also for going to the trouble of providing us a simple test driver), and please stay involved to the extent that you feel inclined!

ClementPernet commented 6 years ago

Great. I confirm that it fixes the issue on the simple test file I sent you, and also on my application which triggered the bug in the first place. Thanks for the quick fix! I'm willing to start playing a bit more with BLIS in the near future from viewpoint of the linear algebra over finite fields, so I will certainly stay involved.

fgvanzee commented 6 years ago

@ClementPernet Glad we could fix your issue.

I don't know much about finite fields. Out of curiosity, can you give me a sense of the type of problems you usually do? Is it typically gemm? What matrix dimensions do you typically work with? And are your matrices usually all square-ish, or do the matrix dimensions (m, n, k) differ significantly? Also, how did you learn about BLIS?

ClementPernet commented 6 years ago

Well, we are developping the fflas-ffpack library https://github.com/linbox-team/fflas-ffpack , providing a set of routines for linear algebra over a finite field. The idea is to stick as close as possible to the standard BLAS and LAPACK API's whenever it makes sense (some notions carry over from R or C to a finite field, some don't). So all use cases are possible in terms of problem dimensions. Our approach is to develop code based as much as possible on gemm, in the same way as modern BLAS and LAPACK, for the same reasons (best arithmetic intensity), but also since we use Strassen's subcubic algorithm which improves even further on the BLAS's speed. I have been following the FLAME project and more recently the BLIS development for a while as I found the approach of automating linear algebra code generation very interesting.

jianyuh commented 6 years ago

@ClementPernet Thanks for sharing your application!

You mentioned Strassen's subcubic algorithm. Actually, we also did some recent work related to the practical implementation of Strassen and other fast matrix multiplication algorithms based on BLIS, which shows better performance over BLAS gemm on matrices of moderate dimensions and requires no additional workspace. In case that you might be interested, I list the related papers and code repository here: Strassen's Algorithm Reloaded, SC16. Generating Families of Practical Fast Matrix Multiplication Algorithms, IPDPS17. Strassen's Algorithm for Tensor Contraction, accepted by SISC. https://github.com/flame/fmm-gen https://github.com/flame/tblis-strassen

fgvanzee commented 6 years ago

@ClementPernet @SudoNohup I'm going to close this issue, but please feel free to continue your discussion here if you like.