Open romanis opened 3 years ago
@romanis I prefer to use index notation, and I think (correct me if I am wrong), that this leads to the following expression for your operation: P_i = \sum_jk W_ij X_ik V_jk
. I would then rearrange as P_i = \sum_jk W_ij V_jk X_ik
or P = diag(WVX^T)
. I also use this operation sometimes (I call it gemm3diag
), currently implemented by splitting into two steps: Q = VX^T; P = diag(WQ)
or Q = WV; P = diag(QX^T)
. I have so far hand-coded the second operation since it is pretty simple and also memory-bound. gemm
in any form is definitely not the right operation for that second step.
Ideally one would fuse the two steps together to save on the extra I/O in the second step. In fact, in a forthcoming version of BLIS you will be able to do exactly that! We've got a few other operations in mind which would similarly benefit from fusion so we'll make sure this one is on the list.
@devinamatthews thank you for very quick reply. you are almost right, just minor change.
I believe that the correct formula would be P=diag(W*(P*V)') = diag(W*V'*P')
But there is no operation that would skip computation of off-diagonal elements if it knows that later it would need to discard those.
To overcome this, I was thinking about casting the second operation as a spars BLAS where memory containing rows of W and product XV is spread such that there is no off-diagonal elements on the sparse product:
Have sparse matrix SW
be like this:
$$ \begin{equation} WS = \begin{bmatrix} W{0,0}&, \dots &, W{0,K}&, 0&,\dots&,0&,0&,\dots,&0,&0,&\dots,&0 \ 0&,\dots&,0&,W{1,0}&,\dots&,W{1,K}&,0&,\dots,&0,&0,&\dots,&0 \ \vdots&,\ddots&,\vdots&,\vdots&,\ddots&,\vdots&,0&,\ddots,&0,&\vdots,&\ddots,&\vdots \ 0&,\dots&,0&,0&,\dots&,0&,0&,\dots,&0,&W{nrow,0},&\dots,&W{nrow,K} \ \end{bmatrix} \end{equation} $$
and similarly spread out the elements of XV
into SXV
. But that still is an overhead of letting sparse BLAS know where each row begins and how many nonzero elements is there etc, so, basically, it will be the same thing as using gemm_batch.
I am not super familiar with memory-bound and CUP-bound concepts, so, dont really understand why can't we just use gemm on the second stage if it gives us the right result.
And more importantly, the question is this: on MKL gemm_batch gives us 50-70% speedup depending on the size of the first matrix, whereas on BLIS gemm_batch gives us a slowdown of 10 times. How come? The code is the same and it even gives the same result eventually.
I would be glad if you can share the efficient implementation of P = diag(QX^T)
The second operation, generically, is something like: C_i = \sum_j A_ij B_ij
. You might want to think of this either a) as a sequence of axpy
s, or b) as a sequence of dot
s. In fact, you can implement it as a loop around either of those primitives. I think you are effectively doing b), but using gemm
for the dot product. Maybe MKL is recognizing this and actually doing a dot product instead? BLIS isn't quite that smart and will do an actual gemm
. Because of how gemm
works, this introduces huge overheads due to unnecessary packing and operating on lots of zeros which are then thrown away.
A simple rule of thumb is that when you are calling gemm
, and one or more of m,n,k are 1, then you should instead be calling another BLAS operation.
This code depends on an Eigen-like library for handling matrices, but the basic idea should be pretty self-explanatory:
/**
* Form the diagonal elements of the matrix multiplication \f$ C = \alpha AB + \beta C \f$.
*
* @param alpha Scalar factor for the product `AB`.
*
* @param A A `m`x`k` matrix or matrix view. Must have either a row or
* column stride of one.
*
* @param B A `k`x`m` matrix or matrix view. Must have either a row or
* column stride of one.
*
* @param beta Scalar factor for the original vector `C`.
*
* @param C A vector or vector view of length `m`.
*/
inline void gemmdiag(double alpha, const cview<2>& A, const cview<2>& B,
double beta, const view<1>& C)
{
auto m = A.length(0);
auto n = A.length(1);
assert(B.length(0) == n);
assert(B.length(1) == m);
assert(C.length(0) == m);
if (A.stride(0) == 1 && B.stride(1) == 1 && C.stride(0) == 1)
{
double* __restrict c = &C[0];
if (beta == 0)
{
for (int i = 0;i < m;i++)
c[i] = 0;
}
else if (beta != 1.0)
{
for (int i = 0;i < m;i++)
c[i] *= beta;
}
for (int j = 0;j < n;j++)
{
const double* __restrict a = &A[0][j];
const double* __restrict b = &B[j][0];
for (int i = 0;i < m;i++)
c[i] += alpha*a[i]*b[i];
}
}
else if (A.stride(1) == 1 && B.stride(0) == 1)
{
for (int i = 0;i < m;i++)
{
const double* __restrict a = &A[i][0];
const double* __restrict b = &B[0][i];
double sum = 0;
for (int j = 0;j < n;j++)
sum += a[j]*b[j];
if (beta == 0)
C[i] = alpha*sum;
else
C[i] = alpha*sum + beta*C[i];
}
}
else
{
for (int i = 0;i < m;i++)
{
double sum = 0;
for (int j = 0;j < n;j++)
sum += A[i][j]*B[j][i];
if (beta == 0)
C[i] = alpha*sum;
else
C[i] = alpha*sum + beta*C[i];
}
}
do_flops(2*m*n);
}
The reason I used gemm_batch is there is no dot_batch routine. That's why I am using gemm as dot, but you are right, what I am effectively doing is just computing dot products in a loop (or in a batch manner)
The second operation, generically, is something like:
C_i = \sum_j A_ij B_ij
. You might want to think of this either a) as a sequence ofaxpy
s, or b) as a sequence ofdot
s. In fact, you can implement it as a loop around either of those primitives. I think you are effectively doing b), but usinggemm
for the dot product. Maybe MKL is recognizing this and actually doing a dot product instead? BLIS isn't quite that smart and will do an actualgemm
. Because of howgemm
works, this introduces huge overheads due to unnecessary packing and operating on lots of zeros which are then thrown away.A simple rule of thumb is that when you are calling
gemm
, and one or more of m,n,k are 1, then you should instead be calling another BLAS operation.
I configured BLIS with the command ./configure -t openmp -p /home/xx/lib/blis -d DEBUG auto
. But I need help finding the symbol cblas_dgemm_batch
from the generated libblis.a
. Did there anything wrong with my command?
[xx@cn7 skx]$ ls
libblis.a libblis.so libblis.so.4
[xx@cn7 skx]$ nm libblis.a | grep cblas_dgemm_batch
[xx@cn7 skx]$
@FuncJ The CBLAS compatibility layer is disabled by default. Try adding --enable-cblas
to your configure
command.
If it behaves like BLAS1, e.g. AXPY or DOT, just write loop code and the compiler will do fine. There is no value at this point in history to using BLAS1 for anything. All it does is get in the way of compiler transformations.
Hi, I have some exotic, but really simple linear algebra package that performs the so-called Face splitting product of two matrices and a vector. Tha matrices A and B have the same number of rows, but potentially different number of columns (n1 and n2), the vector v has n1*n2 elements. The resulting product has the same number of rows as the two matrices. Element
i
of the resulting product equals to row A_i kronecer product with row B_i and the result of that operation should be dot multiplied by vector v. I have implemented a small C library that does the operation in 2 steps: 1 - performing matrix product of matrix B [nrow x n2] times reshaped vector v. reshape makes v into the matrix [n2 x n1] 2 - iterating i from 0 : nrow and compute dot product of row A_i and row i or the result of the first stepMathematically this is equivalent to the initial problem, but this formulation lets me use standard BLAS routines.
I have coded up 3 different ways to compute the product and I verified that all 3 of them give the same result on large variety of matrices. 1 -
face_split_product_naive
- just naively run for loops to compute the result in the most stupid, but sure way. 2 -face_split_product_dgemm
- implementing the two-step procedure with handwritten iterations on the second stage 3 -face_split_product_dgemm_batched
- the same as 2 on the first step and on the second step instead of iterating in a for loop, use MKL and BLIS providedcblas_dgemm_batch
routineHere is the code of the C library:
I got 2 similar AWS instances to test out the code with top of the like Intel and AMD processors. I compile it with either MKL or BLIS. MKL I install with Intel provided scripts, AMD I compile using manual the manual with
I set OMP_NUM_THREADS to 16 since the nodes that I booked have 16 VCPUs. I compile my library and main with
-O3
,-pthread
and-fopenmp
.I run main like that:
./main nrow ncol_W ncol_X
where nrow is the number of rows in matrices A and B, ncol_W is the number of columns in matrix A, ncol_X is the number of columns in matrix B.Main generated random matrices (as std::vector containers) with appropriate sizes, generates vector v of appropriate size and allocates enough memory for the output (two last parameters of the functions). Then it times the execution of the two ways to compute the product on Intel and AMD instances.
Here are the results of the runs on Intel and AMD with processor specs:
and AMD:
One observation is that the naive approach of iterating over rows on the second step is not that different between Intel and AMD, although I expected AMD to be faster here because of the enormously large L3 cache.
The batched approach is however completely backward on AMD. It is not only slower than Intel, but it is slower than the naive approach (even though, again, they give the same result).
If I run top while the main is running, I can see CPU use of Intel at ~1000%, which is far less than 1600% if it ran all available threads, but AMD instance shows 1300-1400 percent usage, so, it genuinely tries to compute the result, but somehow, is super inefficient.
Any reasons for this behavior?