understanding performance overheads in CTF

rohany commented 2 years ago

I'm looking at running some higher-order tensor contraction experiments in CTF, and want to know if the performance I'm seeing is reasonable, or if i'm using the library incorrectly. In particular, I'm running a modified SpTTV kernel (a["i"] = B["ijk"] * c["k"]), where B is a sparse tensor. On the nell-2 tensor, CTF takes around 2 seconds to complete the operation, when running with a rank per-core on a 40-core IBM power9. In comparison, a multi-threaded implementation generated by the TACO compiler takes around 10-15 milliseconds to run. I understand that this isn't the fairest comparison, as CTF has support for scaling to multiple nodes, but I don't know if such a performance difference is expected from a fast single-node implementation. However, I expect that CTF's approach to transform the data into CSR matrices to use pre-written kernels will lead to some slow-down compared to a compilation-based approach.

The code I'm using to generate the CTF numbers is below:

#include <ctf.hpp>
#include <chrono>
#include <float.h>
using namespace CTF;

void spttv(int nIter, int warmup, std::string filename, World& dw) {
  int x = 12092;
  int y = 9184;
  int z = 28818;
  int lens[] = {x, y, z};
  Tensor<double> B(3, true /* is_sparse */, lens, dw);
  Vector<double> a(x, dw);
  Vector<double> c(z, dw);
  c.fill_random(1.0, 1.0);
  a.fill_random(0.0, 0.0);

  auto compute = [&]() {
    a["i"] = B["ijk"] * c["k"];
  };

  B.read_sparse_from_file(filename.c_str());
  if (dw.rank == 0) {
    std::cout << "Read " << B.nnz_tot << " non-zero entries from the file." << std::endl;
  }

  for (int i = 0; i < warmup; i++) { compute(); }
  auto start = std::chrono::high_resolution_clock::now();
  for (int i = 0; i < nIter; i++) { compute(); }
  auto end = std::chrono::high_resolution_clock::now();
  auto ms = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();
  if (dw.rank == 0) {
    std::cout << "Average execution time: " << (double(ms) / double(nIter)) << " ms." << std::endl;
  }
}

int main(int argc, char** argv) {
  int nIter = -1, warmup = -1;
  std::string filename;
  for (int i = 1; i < argc; i++) {
    if (strcmp(argv[i], "-n") == 0) {
      nIter = atoi(argv[++i]);
      continue;
    }
    if (strcmp(argv[i], "-warmup") == 0) {
      warmup = atoi(argv[++i]);
      continue;
    }
    if (strcmp(argv[i], "-tensor") == 0) {
      filename = std::string(argv[++i]);
      continue;
    }
  }

  if (nIter == -1 || warmup == -1 || filename.empty()) {
    std::cout << "provide all inputs." << std::endl;
    return -1;
  }

  MPI_Init(&argc, &argv);
  int np;
  MPI_Comm_size(MPI_COMM_WORLD, &np);
  {
    World dw;
    spttv(nIter, warmup, filename, dw);
  }
  MPI_Finalize();
  return 0;
}

solomonik commented 2 years ago

Thanks for asking about this. The slow performance is not too surprising, what CTF will do when the contraction is specified this way is to first compute a partial sum of the tensor over the j index, which yields a sparse output matrix, and is slow due to the associated sparse data management. Instead, I would recommend to use the CTF MTTKRP kernel by multiplying over the j index with a vector full of 1s. The general sparse contraction/summation kernels are orders of magnitude slower than the multi-tensor MTTKRP kernel (see https://arxiv.org/pdf/2106.07135.pdf for details on this). There are also other multi-tensor contraction kernels present in CTF, and a general one is currently in development. These currently need to be invoked manually, when contractions are specified with Einsum syntax, pairwise contractions will be performed and performance can suffer e.g., if some intermediate computed quantities are sparse.

rohany commented 2 years ago

Thanks edgar! Can you point me to the docs or the header files that have these kernels declared? I'm not sure how to go about finding them (a search on the doc-site came up blank).

solomonik commented 2 years ago

Sorry, yeah documentation of this may be incomplete. There is some albeit python-oriented introduction to this in the tutorial,

https://github.com/cyclops-community/ctf/wiki/Tensor-introduction#multi-tensor-contraction

The currently available multi-tensor routines are mainly just MTTKRP and TTTP (generalization of SDDMM). I think @raghavendrak may also have some implementation of TTMc for third order tensors on another fork. This is the C++ interface file for them

https://github.com/cyclops-community/ctf/blob/master/src/interface/multilinear.h

rohany commented 2 years ago

I'm sorry Edgar, can you give me a little more guidance about how to use the MTTKRP kernel in multilinear.h to compute the kernel I'm looking at (a["i"] = B["ijk"] * c["k"])? I can't understand from the comments how this works (it seems like if B is 3-dimensional, then the other operands need to be two dimensional).

solomonik commented 2 years ago

MTTKRP performs A["ir"] = B["ijk"]*C["kr"]*D["kr"] so if A, C, D are n-by-1 and D["kr"]=1 (that should also be interpretable as a comamnd by CTF in C++) you would recover the operation you are interested in as a special case. I think the mat_list interface will accept vectors too and automatically do this (if not, you can call the reshape function).

rohany commented 2 years ago

Thanks for your help Edgar! I think that I've implemented this as you imagine:

  int x = 12092;
  int y = 9184;
  int z = 28818;
  int lens[] = {x, y, z};
  Tensor<double> B(3, true /* is_sparse */, lens, dw);
  Vector<double> a(x, dw);
  Vector<double> c(z, dw);
  Vector<double> d(z, dw);
  c.fill_random(1.0, 1.0);
  d.fill_random(1.0, 1.0);
  a.fill_random(0.0, 0.0);
  Tensor<double>* mats[] = {&a, &c, &d};
  MTTKRP(&B, mats, 0, false);

I was unsure about the last argument, or where to put the output tensor, but it seemed to work out?

Using this specialized kernel removed a good portion of the overhead, bringing it from 2 seconds to 90ms (vs 10ms hand coded).

solomonik commented 2 years ago

The output tensor is the 0th argument since you passed 0 to the kernel, so that looks fine (I assume B is filled with nonzeros elsewhere). We recorded 2-4X overhead relative to SPLATT MTTKRP, which uses CSF. Current implementation uses a less compact default sparse tensor format than CSF (we are working on integration of CSF into this and other kernels) and is minimally optimized.

rohany commented 2 years ago

One last question. If I understand correctly, the above approach isn't going to work out if I want to run a slightly different SpTTV kernel: a["ij"] = B["ijk"] * c["k"]?

solomonik commented 2 years ago

Yes, for that, just performing the CTF contraction as usual would be advisable. Choosing A to be sparse or dense will lead to different performance, depending on sparsity of B. The multi-tensor kernels we've introduced so far are driven by common applications in tensor decomposition methods.

rohany commented 2 years ago

Hi again edgar, I'm trying to use the MTTKRP function to actually compute an MTTKRP. Here's my code, which should be evaluating the contraction A[i, l] = B[i, j, k] * C[j, l] * D[k, l].

void mttkrp(int nIter, int warmup, std::string filename, std::vector<int> dims, World& dw, int ldim) {
  Tensor<double> B(3, true /* is_sparse */, dims.data(), dw);
  Matrix<double> A(dims[0], ldim, dw);
  Matrix<double> C(dims[1], ldim, dw);
  Matrix<double> D(dims[2], ldim, dw);
  C.fill_random(1.0, 1.0);
  D.fill_random(1.0, 1.0);

  B.read_sparse_from_file(filename.c_str());
  if (dw.rank == 0) {
    std::cout << "Read " << B.nnz_tot << " non-zero entries from the file." << std::endl;
  }

  Tensor<double>* mats[] = {&A, &C, &D};
  auto avgMs = benchmarkWithWarmup(warmup, nIter, [&]() {
    MTTKRP(&B, mats, 0, false);
  });

  if (dw.rank == 0) {
    std::cout << "Average execution time: " << avgMs << " ms." << std::endl;
  }
}

This code segfaults with the following backtrace:

(gdb) bt
#0  0x00000000100302a4 in CTF_int::handler() ()
#1  0x00000000100190b4 in CTF::Semiring<double, true>::MTTKRP(int, long*, int*, long, long, int, bool, CTF::Pair<double> const*, double const* const*, double*) ()
#2  0x0000000010019a1c in void CTF::MTTKRP<double>(CTF::Tensor<double>*, CTF::Tensor<double>**, int, bool) ()
#3  0x000000001000ebf8 in std::_Function_handler<void (), mttkrp(int, int, std::string, std::vector<int, std::allocator<int> >, CTF::World&, int)::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()
#4  0x000000001000e740 in mttkrp(int, int, std::string, std::vector<int, std::allocator<int> >, CTF::World&, int) ()
#5  0x0000000010008a98 in main ()

Is there anything here that looks incorrect to you? This code was adapted from the working version you helped me with above.

rohany commented 2 years ago

Somewhat related, I was looking at the TTTP implementation, trying to implement SDDMM. Based on the comment at least, I'm not sure how to use the handwritten function to implement a contraction A[i, k] = B[i, k] * C[i, j] * D[j, k]. I dont understand how to map the indices in the comment For ndim=3 and mat_list=[X,Y,Z], this operation is equivalent to einsum("ijk,ia,ja,ka->ijk",A,X,Y,Z) to something like this.

raghavendrak commented 2 years ago

I think the segfault is because you are using false for aux_mode_first (currently, there is no implementation available for aux_mode_first = false). Sample code for MTTKRP to achieve ijk,ja,ka->ia:

    Tensor<dtype> UT(2, false, lens, dw);
    UT["ji"] = U["ij"]; 
    Tensor<dtype> VT(2, false, lens, dw);
    VT["ji"] = V["ij"]; 
    Tensor<dtype> UC(2, false, lens, dw);
    Tensor<dtype> * mlist[3] = {&UC, &UT, &VT};
    MTTKRP<dtype>(&T, mlist, 0, true);

rohany commented 2 years ago

In my comment above (https://github.com/cyclops-community/ctf/issues/134#issuecomment-993948156), I used aux_mode_first = false, and that seemed to work in a different situation.

That seems to be the einsum notation expression I want, but I'm unsure why you are transposing the matrices.

raghavendrak commented 2 years ago

In the comment you mentioned, the factors are vectors. Since there is only one index, aux_mode_first false or true will be the same. I think you are seeing the segfault from this: https://github.com/cyclops-community/ctf/blob/c4f89dcfc6fce48b6d8da45204c9c309f3900a8f/src/interface/semiring.h#L1259 aux_mode_first = true expects the auxiliary index (in the contraction) to show up first.

rohany commented 2 years ago

Thanks @raghavendrak! Following up, do you have an explanation why you transposed the matrices in your example?

Also, wondering if you have an answer to my question about TTTP/SDDMM.

solomonik commented 2 years ago

@rohany I think Raghavendra had the transposes to compute the MTTKRP in the form you are looking for, i.e., sum{ij} T{ijk}U{ia}V{ja} since our routine evidently can currently only do sum{ij} T{ijk}U{ai}V{aj} (U and V) transposed. For SDDMM, you should use TTTP with an order 2 sparse tensor, performing the einsum("ij,ia,ja->ij",A,X,Y) to do the SDDMM Hadamard_product(A,X^TY). Let me know if all this makes sense.

rohany commented 2 years ago

Thanks, the MTTKRP transpose portion makes sense to me.

I'm still confused about SDDMM -- the einsum you wrote there is not isomorphic to what I want to compute: A[i, k] = B[i, k] * C[i, j] * D[j, k].

solomonik commented 2 years ago

@rohany I think it is, you have "ik,ij,jk->ik", which is my einsum modulo transposition and substitutions j->k, a->j.

rohany commented 2 years ago

If I'm reading this correctly, you wrote einsum("ij,ia,ja->ij",A,X,Y), which translated to my notation is A[i, j] = A[i, j] * X[i, a] * Y[i, a], while my desired computation is A[i, k] = B[i, k] * C[i, j] * D[j, k]. This is not isomorphic, as the operation X[i, a] * Y[i, a] does not form a matrix multiplication between X and Y.

solomonik commented 2 years ago

einsum("ij,ia,ja->ij",A,X,Y) is A[i, j] = A[i, j] * X[i, a] * Y[j, a] not A[i, j] = A[i, j] * X[i, a] * Y[i, a]

rohany commented 2 years ago

Yes, sorry. A[i, j] = A[i, j] * X[i, a] * Y[j, a] is still not correct right, as X[i, a] * Y[j, a] is not a matrix multiplication -- it should be X[i, a] * Y[a, j] right?

solomonik commented 2 years ago

I think X[i, a] * Y[j, a] is a matrix multiplication (at least modulo transposition, as I mentioned in my post), i.e. it is XY^T

solomonik commented 2 years ago

The CTF interface assumes all matrices (X,Y) are oriented in the same way, which makes much more sense in generalizing to tensors.

rohany commented 2 years ago

That make sense to me now. I will transpose my Y matrix to fit the interface.

rohany commented 2 years ago

Thanks for all the help! Code seems to be up and running now, so I'll close the issue.

rohany commented 2 years ago

Hi @solomonik, I was looking at this again:

The general sparse contraction/summation kernels are orders of magnitude slower than the multi-tensor MTTKRP kernel (see https://arxiv.org/pdf/2106.07135.pdf for details on this).

I was scanning the paper you linked, but I'm not finding discussion of distributed sparse MTTKRP in that paper -- is it the right link?

solomonik commented 2 years ago

Its not, sorry about that, here is the right link https://arxiv.org/pdf/1910.02371.pdf

cyclops-community / ctf

understanding performance overheads in CTF #134