Very poor performance of cblas_dgemm_batch with BLIS

romanis commented 3 years ago

Hi, I have some exotic, but really simple linear algebra package that performs the so-called Face splitting product of two matrices and a vector. Tha matrices A and B have the same number of rows, but potentially different number of columns (n1 and n2), the vector v has n1*n2 elements. The resulting product has the same number of rows as the two matrices. Element i of the resulting product equals to row A_i kronecer product with row B_i and the result of that operation should be dot multiplied by vector v. I have implemented a small C library that does the operation in 2 steps: 1 - performing matrix product of matrix B [nrow x n2] times reshaped vector v. reshape makes v into the matrix [n2 x n1] 2 - iterating i from 0 : nrow and compute dot product of row A_i and row i or the result of the first step

Mathematically this is equivalent to the initial problem, but this formulation lets me use standard BLAS routines.

I have coded up 3 different ways to compute the product and I verified that all 3 of them give the same result on large variety of matrices. 1 - face_split_product_naive - just naively run for loops to compute the result in the most stupid, but sure way. 2 - face_split_product_dgemm - implementing the two-step procedure with handwritten iterations on the second stage 3 - face_split_product_dgemm_batched - the same as 2 on the first step and on the second step instead of iterating in a for loop, use MKL and BLIS provided cblas_dgemm_batch routine

Here is the code of the C library:

#include "mkl.h"
// #include "cblas.h"
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>

void face_split_product_naive(const double* W, const double* X, const double* v, int nrow, int ncol_W, int ncol_X, double* P){
    relocate_array(&P, nrow);
    for(int row = 0; row< nrow; ++row){
        for(int i = 0; i < ncol_W; ++i){
            double local_sum = 0;
            for(int j = 0; j< ncol_X; ++j){
                local_sum +=  X[row*ncol_X+j] * v[i*ncol_X + j];
            }
            P[row] += W[row*ncol_W + i] * local_sum;
        }
    }
}

void face_split_product_dgemm(const double* W, const double* X, const double* v, int nrow, int ncol_W, int ncol_X, double* P, double* XV){
    cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasTrans, nrow, ncol_W, ncol_X, 1, X, ncol_X, v, ncol_X, 0, XV, ncol_W);
    for(int row = 0; row< nrow; ++row){
        cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasTrans, 1, 1, ncol_W, 1, &W[row*ncol_W], ncol_W, &XV[row*ncol_W], ncol_W, 0, &P[row], 1);
    }
}

void face_split_product_dgemm_batched(const double* W, const double* X, const double* v, int nrow, int ncol_W, int ncol_X, double* P, double* XV){
    cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasTrans, nrow, ncol_W, ncol_X, 1, X, ncol_X, v, ncol_X, 0, XV, ncol_W);
    CBLAS_TRANSPOSE* transa_array = (CBLAS_TRANSPOSE*) malloc(nrow*sizeof(CBLAS_TRANSPOSE));
    CBLAS_TRANSPOSE* transb_array = (CBLAS_TRANSPOSE*) malloc(nrow*sizeof(CBLAS_TRANSPOSE));

    int* m_array = (int*) malloc(nrow*sizeof(int));
    int* n_array = (int*) malloc(nrow*sizeof(int));
    int* k_array = (int*) malloc(nrow*sizeof(int));
    double* alpha_array = (double*) malloc(nrow*sizeof(double));
    double* beta_array  = (double*) malloc(nrow*sizeof(double));
    int* lda_array = (int*) malloc(nrow*sizeof(int));
    int* ldb_array = (int*) malloc(nrow*sizeof(int));
    int* ldc_array = (int*) malloc(nrow*sizeof(int));
    int* group_size = (int*) malloc(nrow*sizeof(int));
    const double** a_array = (const double**) malloc(nrow*sizeof(double*));
    const double** b_array = (const double**) malloc(nrow*sizeof(double*));
    double** c_array = (double**) malloc(nrow*sizeof(double*));
    for(int i = 0; i<nrow; ++i){
        transa_array[i] = CblasNoTrans;
        transb_array[i] = CblasTrans;
        m_array[i] = 1;
        n_array[i] = 1;
        k_array[i] = ncol_W;
        alpha_array[i] = 1;
        beta_array[i] = 0;
        lda_array[i]=ncol_W;
        ldb_array[i]=ncol_W;
        ldc_array[i]=1;
        group_size[i]=1;
        a_array[i] = (&W[i*ncol_W]);
        b_array[i] = (&XV[i*ncol_W]);
        c_array[i] = (&P[i]);
    }
    cblas_dgemm_batch(CblasRowMajor, transa_array, transb_array, m_array, n_array, k_array, 
                        alpha_array, a_array, lda_array, 
                        b_array, ldb_array, beta_array, 
                        c_array, ldc_array, nrow, group_size);
    free(transa_array);    free(transb_array);    free(m_array);
    free(n_array);    free(k_array);    free(alpha_array);    free(beta_array);
    free(lda_array);    free(ldb_array);    free(ldc_array);
    free(group_size);
    free(a_array);    free(b_array);    free(c_array);
}

I got 2 similar AWS instances to test out the code with top of the like Intel and AMD processors. I compile it with either MKL or BLIS. MKL I install with Intel provided scripts, AMD I compile using manual the manual with

./configure --enable-cblas --disable-sup-handling --enable-threading=openmp --prefix=/opt/amd/ auto

I set OMP_NUM_THREADS to 16 since the nodes that I booked have 16 VCPUs. I compile my library and main with -O3, -pthread and -fopenmp.

I run main like that: ./main nrow ncol_W ncol_X where nrow is the number of rows in matrices A and B, ncol_W is the number of columns in matrix A, ncol_X is the number of columns in matrix B.

Main generated random matrices (as std::vector containers) with appropriate sizes, generates vector v of appropriate size and allocates enough memory for the output (two last parameters of the functions). Then it times the execution of the two ways to compute the product on Intel and AMD instances.

vector<double> XV;
XV.reserve(ncol_W*nrow_W);
Pn_c.reserve(nrow_W);
Psplit_c.reserve(nrow_W);
tic = chrono::high_resolution_clock::now();

face_split_product_dgemm(W.data(), X.data(), V.data(), nrow_W, ncol_W, ncol_X, Pn_c.data(), XV.data());
cout<<"time taken by multiplication is " << std::chrono::duration_cast<std::chrono::milliseconds>(chrono::high_resolution_clock::now() - tic).count()<<" milliseconds size of Pn_c " << Pn_c.size()<<endl;

tic = chrono::high_resolution_clock::now();
face_split_product_dgemm_batched(W.data(), X.data(), V.data(), nrow_W, ncol_W, ncol_X, Psplit_c.data(), XV.data());
cout<<"time taken by multiplication is " << std::chrono::duration_cast<std::chrono::milliseconds>(chrono::high_resolution_clock::now() - tic).count()<<" milliseconds size of Psplit_c "<<Psplit_c.size()<<endl;

Here are the results of the runs on Intel and AMD with processor specs:

processor       : 15
vendor_id       : GenuineIntel
cpu family      : 6
model           : 85
model name      : Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
stepping        : 7
microcode       : 0x5003005
cpu MHz         : 3200.224
cache size      : 36608 KB
physical id     : 0
siblings        : 16
core id         : 7
cpu cores       : 8
apicid          : 15
initial apicid  : 15
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke avx512_vnni
bugs            : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit
bogomips        : 4999.99
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:
[ec2-user@ip-172-31-13-192 environment]$ ./main 1000000 200 300                                                                                                                                                              
1000000 200 300
time taken by multiplication is 786 milliseconds size of Pn_c 1000000
time taken by multiplication is 427 milliseconds size of Psplit_c 1000000
[ec2-user@ip-172-31-13-192 environment]$ ./main 1000000 2000 300
1000000 2000 300
time taken by multiplication is 5899 milliseconds size of Pn_c 1000000
time taken by multiplication is 2672 milliseconds size of Psplit_c 1000000
[ec2-user@ip-172-31-13-192 environment]$ ./main 1000000 200 3000                                                                                                                                                             
1000000 200 3000
time taken by multiplication is 3543 milliseconds size of Pn_c 1000000
time taken by multiplication is 3185 milliseconds size of Psplit_c 1000000
[ec2-user@ip-172-31-13-192 environment]$ ./main 100000 2000 3000                                                                                                                                                             
100000 2000 3000
time taken by multiplication is 2558 milliseconds size of Pn_c 100000
time taken by multiplication is 2214 milliseconds size of Psplit_c 100000
[ec2-user@ip-172-31-13-192 environment]$ ./main 1000000 2000 300
1000000 2000 300
time taken by multiplication is 6064 milliseconds size of Pn_c 1000000
time taken by multiplication is 2708 milliseconds size of Psplit_c 1000000

and AMD:

processor       : 15
vendor_id       : AuthenticAMD
cpu family      : 23
model           : 1
model name      : AMD EPYC 7571
stepping        : 2
microcode       : 0x800126c
cpu MHz         : 2520.255
cache size      : 512 KB
physical id     : 0
siblings        : 16
core id         : 7
cpu cores       : 8
apicid          : 15
initial apicid  : 15
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext perfctr_core vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr arat npt nrip_save
bugs            : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass
bogomips        : 4399.98
TLB size        : 2560 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 48 bits physical, 48 bits virtual
power management:

ubuntu@ip-172-31-87-241:~/environment$ ./main 100000 200 300
100000 200 300
time taken by multiplication is 121 milliseconds size of Pn_c 100000
time taken by multiplication is 1612 milliseconds size of Psplit_c 100000
ubuntu@ip-172-31-87-241:~/environment$ ./main 1000000 200 300                                                                                                                                                               
1000000 200 300
time taken by multiplication is 1196 milliseconds size of Pn_c 1000000
time taken by multiplication is 16238 milliseconds size of Psplit_c 1000000

One observation is that the naive approach of iterating over rows on the second step is not that different between Intel and AMD, although I expected AMD to be faster here because of the enormously large L3 cache.

The batched approach is however completely backward on AMD. It is not only slower than Intel, but it is slower than the naive approach (even though, again, they give the same result).

If I run top while the main is running, I can see CPU use of Intel at ~1000%, which is far less than 1600% if it ran all available threads, but AMD instance shows 1300-1400 percent usage, so, it genuinely tries to compute the result, but somehow, is super inefficient.

Any reasons for this behavior?

devinamatthews commented 3 years ago

@romanis I prefer to use index notation, and I think (correct me if I am wrong), that this leads to the following expression for your operation: P_i = \sum_jk W_ij X_ik V_jk. I would then rearrange as P_i = \sum_jk W_ij V_jk X_ik or P = diag(WVX^T). I also use this operation sometimes (I call it gemm3diag), currently implemented by splitting into two steps: Q = VX^T; P = diag(WQ) or Q = WV; P = diag(QX^T). I have so far hand-coded the second operation since it is pretty simple and also memory-bound. gemm in any form is definitely not the right operation for that second step.

Ideally one would fuse the two steps together to save on the extra I/O in the second step. In fact, in a forthcoming version of BLIS you will be able to do exactly that! We've got a few other operations in mind which would similarly benefit from fusion so we'll make sure this one is on the list.

romanis commented 3 years ago

@devinamatthews thank you for very quick reply. you are almost right, just minor change. I believe that the correct formula would be P=diag(W*(P*V)') = diag(W*V'*P') But there is no operation that would skip computation of off-diagonal elements if it knows that later it would need to discard those. To overcome this, I was thinking about casting the second operation as a spars BLAS where memory containing rows of W and product XV is spread such that there is no off-diagonal elements on the sparse product: Have sparse matrix SW be like this:

$$ \begin{equation} WS = \begin{bmatrix} W{0,0}&, \dots &, W{0,K}&, 0&,\dots&,0&,0&,\dots,&0,&0,&\dots,&0 \ 0&,\dots&,0&,W{1,0}&,\dots&,W{1,K}&,0&,\dots,&0,&0,&\dots,&0 \ \vdots&,\ddots&,\vdots&,\vdots&,\ddots&,\vdots&,0&,\ddots,&0,&\vdots,&\ddots,&\vdots \ 0&,\dots&,0&,0&,\dots&,0&,0&,\dots,&0,&W{nrow,0},&\dots,&W{nrow,K} \ \end{bmatrix} \end{equation} $$

and similarly spread out the elements of XV into SXV. But that still is an overhead of letting sparse BLAS know where each row begins and how many nonzero elements is there etc, so, basically, it will be the same thing as using gemm_batch.

I am not super familiar with memory-bound and CUP-bound concepts, so, dont really understand why can't we just use gemm on the second stage if it gives us the right result.

And more importantly, the question is this: on MKL gemm_batch gives us 50-70% speedup depending on the size of the first matrix, whereas on BLIS gemm_batch gives us a slowdown of 10 times. How come? The code is the same and it even gives the same result eventually.

romanis commented 3 years ago

I would be glad if you can share the efficient implementation of P = diag(QX^T)

devinamatthews commented 3 years ago

The second operation, generically, is something like: C_i = \sum_j A_ij B_ij. You might want to think of this either a) as a sequence of axpys, or b) as a sequence of dots. In fact, you can implement it as a loop around either of those primitives. I think you are effectively doing b), but using gemm for the dot product. Maybe MKL is recognizing this and actually doing a dot product instead? BLIS isn't quite that smart and will do an actual gemm. Because of how gemm works, this introduces huge overheads due to unnecessary packing and operating on lots of zeros which are then thrown away.

A simple rule of thumb is that when you are calling gemm, and one or more of m,n,k are 1, then you should instead be calling another BLAS operation.

devinamatthews commented 3 years ago

This code depends on an Eigen-like library for handling matrices, but the basic idea should be pretty self-explanatory:

/**
 * Form the diagonal elements of the matrix multiplication \f$ C = \alpha AB + \beta C \f$.
 *
 * @param alpha Scalar factor for the product `AB`.
 *
 * @param A     A `m`x`k` matrix or matrix view. Must have either a row or
 *              column stride of one.
 *
 * @param B     A `k`x`m` matrix or matrix view. Must have either a row or
 *              column stride of one.
 *
 * @param beta  Scalar factor for the original vector `C`.
 *
 * @param C     A vector or vector view of length `m`.
 */
inline void gemmdiag(double alpha, const cview<2>& A, const cview<2>& B,
                     double beta, const view<1>& C)
{
    auto m = A.length(0);
    auto n = A.length(1);

    assert(B.length(0) == n);
    assert(B.length(1) == m);
    assert(C.length(0) == m);

    if (A.stride(0) == 1 && B.stride(1) == 1 && C.stride(0) == 1)
    {
        double* __restrict c = &C[0];

        if (beta == 0)
        {
            for (int i = 0;i < m;i++)
                c[i] = 0;
        }
        else if (beta != 1.0)
        {
            for (int i = 0;i < m;i++)
                c[i] *= beta;
        }

        for (int j = 0;j < n;j++)
        {
            const double* __restrict a = &A[0][j];
            const double* __restrict b = &B[j][0];

            for (int i = 0;i < m;i++)
                c[i] += alpha*a[i]*b[i];
        }
    }
    else if (A.stride(1) == 1 && B.stride(0) == 1)
    {
        for (int i = 0;i < m;i++)
        {
            const double* __restrict a = &A[i][0];
            const double* __restrict b = &B[0][i];

            double sum = 0;
            for (int j = 0;j < n;j++)
                sum += a[j]*b[j];

            if (beta == 0)
                C[i] = alpha*sum;
            else
                C[i] = alpha*sum + beta*C[i];
        }
    }
    else
    {
        for (int i = 0;i < m;i++)
        {
            double sum = 0;
            for (int j = 0;j < n;j++)
                sum += A[i][j]*B[j][i];

            if (beta == 0)
                C[i] = alpha*sum;
            else
                C[i] = alpha*sum + beta*C[i];
        }
    }

    do_flops(2*m*n);
}

romanis commented 3 years ago

The reason I used gemm_batch is there is no dot_batch routine. That's why I am using gemm as dot, but you are right, what I am effectively doing is just computing dot products in a loop (or in a batch manner)

HPC4AI commented 1 year ago

The second operation, generically, is something like: C_i = \sum_j A_ij B_ij. You might want to think of this either a) as a sequence of axpys, or b) as a sequence of dots. In fact, you can implement it as a loop around either of those primitives. I think you are effectively doing b), but using gemm for the dot product. Maybe MKL is recognizing this and actually doing a dot product instead? BLIS isn't quite that smart and will do an actual gemm. Because of how gemm works, this introduces huge overheads due to unnecessary packing and operating on lots of zeros which are then thrown away.

A simple rule of thumb is that when you are calling gemm, and one or more of m,n,k are 1, then you should instead be calling another BLAS operation.

I configured BLIS with the command ./configure -t openmp -p /home/xx/lib/blis -d DEBUG auto. But I need help finding the symbol cblas_dgemm_batch from the generated libblis.a. Did there anything wrong with my command?

my output

[xx@cn7 skx]$ ls
libblis.a  libblis.so  libblis.so.4
[xx@cn7 skx]$ nm libblis.a | grep cblas_dgemm_batch
[xx@cn7 skx]$

fgvanzee commented 1 year ago

@FuncJ The CBLAS compatibility layer is disabled by default. Try adding --enable-cblas to your configure command.

jeffhammond commented 1 year ago

If it behaves like BLAS1, e.g. AXPY or DOT, just write loop code and the compiler will do fine. There is no value at this point in history to using BLAS1 for anything. All it does is get in the way of compiler transformations.

flame / blis

Very poor performance of cblas_dgemm_batch with BLIS #564