Hi Devin, Thanks for this nice library!

I'm currently testing the library, and I found the following: For real tensors, tblis is substantially faster than a contraction using dgemm for the cases I tested. For double matrices, dgemm and tblis are about equal. For a test case of 1000 matrix multiplications of two 200 x 200 matrices, a timing gives (using OMP_NUM_THREADS=4 on a 4 core machine)

TBLIS: fastest of 1000 contractions: 281 microseconds TBLIS: average time per contraction: 390.587 microseconds DGEMM: fastest of 1000 contractions: 311 microseconds DGEMM: average time per contraction: 332.135 microseconds

For complex types however, the contractions are actually much slower than when using zgemm. When I compile and run the code below (1000 matrix multiplications of two 200 x 200 matrices), I get the following output:

TBLIS: fastest of 1000 contractions: 14917 microseconds TBLIS: average time per contraction: 15622.9 microseconds ZGEMM: fastest of 1000 contractions: 932 microseconds ZGEMM: average time per contraction: 1053.49 microseconds

Is this to be expected, or am I doing something wrong? I was building with g++ 5 o and comparing to zgemm implementation from openblas.

thanks!

Below, mat_mat_prod is a wrapper around zgemm;

include

include "lib/linalg/blasroutines.hpp"

typedef double Real; typedef complex Complex;

using std::cout; using std::endl; using std::vector;

int main(int argc, char** argv){

int d1=200,d2=200,d3=200; vector < tblis::label_type > la={1,2}; vector < tblis::label_type > lb={2,3}; vector < tblis::label_type > lc={1,3}; tblis::tensor < Complex> A({d1,d2}); tblis::tensor < Complex> B({d2,d3}); tblis::tensor < Complex > C({d1,d3}); Complex beta=0.0; Complex alpha=1.0; auto mA=new Complex [d1d2]; auto mB=new Complex [d2d3]; auto mC=new Complex [d1d3]; int Nmax=1000; vector time(Nmax); vector time2(Nmax); cout<<"doing TBLIS contraction"<<endl; for(uint n=0;n<Nmax;n++){ auto time0 = std::chrono::high_resolution_clock::now(); tblis::mult(alpha,A,la.data(),B,lb.data(),beta,C,lc.data()); auto time1 = std::chrono::high_resolution_clock::now(); time[n]=std::chrono::duration_cast(time1-time0).count(); } cout<<"doing ZGEMM contraction"<<endl; for(uint n=0;n<Nmax;n++){ auto time0 = std::chrono::high_resolution_clock::now(); blasroutines::mat_mat_prod(alpha,mA ,d1,d2,mB,d2,d3,mC,'N','N'); auto time1 = std::chrono::high_resolution_clock::now(); time2[n]=std::chrono::duration_cast(time1-time0).count(); } cout<< "TBLIS: fastest of "<<Nmax<<" contractions: "<<std::min_element(time.begin(),time.end()) << " microseconds\n"; cout<< "TBLIS: average time per contraction: "<<std::accumulate(time.begin(),time.end(),0.0)1.0/Nmax << " microseconds\n"; cout<< "ZGEMM: fastest of "<<Nmax<<" contractions: "<<std::min_element(time2.begin(),time2.end()) << " microseconds\n"; cout<< "ZGEMM: average time per contraction: "<<std::accumulate(time2.begin(),time2.end(),0.0)*1.0/Nmax << " microseconds\n";

delete[] mA; delete[] mB; delete[] mC; return 0; }

devinamatthews / tblis

Complex Tensor Contractions are unusually slow #18

include

include

include "lib/linalg/blasroutines.hpp"