cyclops-community / ctf

Cyclops Tensor Framework: parallel arithmetic on multidimensional arrays
Other
194 stars 53 forks source link

Performance of Simple Program #32

Closed 0tt3r closed 6 years ago

0tt3r commented 7 years ago

In the past, I've used MATLAB to do some tensor calculations, My tensors are starting to get too big for single node memory. I heard about CTF and decided that I would try it. I wrote a very simple piece of code which does one of my tensor operations. I started with random data just to gauge the performance / memory usage. The full code is below. The important part is the operation line:

A["ijkl"] = L["ik"]*I["ik"]*O["ikl"]*M["lj"];

My MATLAB program does this in ~1s (using bsxfun). During that short time, it uses multiple threads (maybe 2-4). The CTF code below takes ~45s using mpiexec -np 4 ./simple_test with gcc and about the same using intel and mkl. I used the same flags that were auto generated by the ./configure script. I suspect that I haven't written optimal CTF code. Do you have any suggestions about how to optimize this code?

#include <ctf.hpp>
using namespace CTF;

int main(int argc, char ** argv){
  int np, rank;

  MPI_Init(&argc, &argv);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Comm_size(MPI_COMM_WORLD, &np);

  World dw(argc, argv);

  int n_theta = 968, n_w = 7740, n_e = 3, n_E = 16;

  Matrix<double> L(n_theta,n_w,dw);
  Matrix<double> I(n_theta,n_w,dw);
  Matrix<double> M(n_e,n_E,dw);
  int shapeLIO[] = {NS, NS, NS};
  int sizeLIO[] = {n_theta, n_w, n_e};
  Tensor<double> O(3, sizeLIO, shapeLIO, dw);

  int shapeA2[] = {NS, NS, NS, NS};
  int sizeA2[] = {n_theta, n_E, n_w, n_e};
  Tensor<double> A(4, sizeA2, shapeA2, dw);

  srand48(dw.rank);
  O.fill_random(0.0,1.0);
  I.fill_random(0.0,1.0);
  M.fill_random(0.0,1.0);
  L.fill_random(0.0,1.0);

  A["ijkl"] = L["ik"]*I["ik"]*O["ikl"]*M["lj"];

  MPI_Finalize();
}
solomonik commented 7 years ago

Hi 0tt3r,

Indeed the operations you are performing have poor absolute performance in CTF. When an index appears in the operands as well as the output, as here, a naive unoptimized (actually its generality makes it slower than just a simple unoptimized) sequential code is used inside CTF. This is a known limitation that we hope to improve upon in the near future.

Another problem is that CTF is not currently good at factorizing expressions like ABC*D. Most codes that achieve high performance using CTF currently are written in the style

C += AB C = D

ideally (dominated by) contractions that can be mapped to matrix multiplication after transposition. When the operations are pure tensor contractions (each index appears in exactly two tensors), much higher sequential performance is achieved due to leverage of BLAS.

Modifying your code in the below fashion gives 4x better performance on my machine

L["ik"] = I["ik"]; O["ikl"] = L["ik"]; A["ijkl"] = O["ikl"]*M["lj"];

Of course, writing code in this way is cumbersome and generally requires defining auxiliary intermediates.

The only positive news I have is that CTF is at least expected to scale well for such code on a distributed memory system. On my desktop, execution time goes from 12.8 to 3.2 sec going from 1 process to 4. After that there is much less return for increasing the number of processes (despite the presence of 12 cores), likely due to memory bandwidth saturation.

I'll leave this issue open as it highlights two of the main performance deficiencies of CTF currently.

0tt3r commented 7 years ago

Hi solmonik,

Thank you for the quick replay! Rewriting the kernel like that greatly improves performance. That is great to hear. I was also seeing poor parallel performance with the original code, and now I know why.

The other thing I'd like to do is leverage the sparsity. L is ~1% sparse, but O, I, and M are all dense. In some slide you presented, you showed decent speedup at this level of sparsity. I know there is an open issue using sparsity, and I think that my case falls into the affected area. I'd like to write:

L.sparsify(0.99)
L["ik"] *= I["ik"];
O["ikl"] *= L["ik"];
O.sparsify()
A["ijkl"] = O["ikl"]*M["lj"];

Is this possible using custom functions, as you mention elsewhere? Would I see any speedup if I did that?

solomonik commented 7 years ago

Unfortunately, this wont work until issue #23 is resolved.

0tt3r commented 7 years ago

That's what I expected. Thank you for your time!

solomonik commented 6 years ago

Cyclops v1.5.0 will have integration with MKL batched GEMM as well as better naive kernels for plain Hadamard products. The contractions you mentioned should actually work just fine. I think previously I thought one of A, O, or M is sparse, so the last contraction is with sparse tensors. Hoping to add naive support for that soon also just by converting to a higher-order sparse tensor, e.g. a pointwise product of a sparse vector and a dense vector can be written as a product of a sparse matrix and a dense vector.