Closed rohany closed 2 years ago
Thanks for asking about this. The slow performance is not too surprising, what CTF will do when the contraction is specified this way is to first compute a partial sum of the tensor over the j index, which yields a sparse output matrix, and is slow due to the associated sparse data management. Instead, I would recommend to use the CTF MTTKRP kernel by multiplying over the j index with a vector full of 1s. The general sparse contraction/summation kernels are orders of magnitude slower than the multi-tensor MTTKRP kernel (see https://arxiv.org/pdf/2106.07135.pdf for details on this). There are also other multi-tensor contraction kernels present in CTF, and a general one is currently in development. These currently need to be invoked manually, when contractions are specified with Einsum syntax, pairwise contractions will be performed and performance can suffer e.g., if some intermediate computed quantities are sparse.
Thanks edgar! Can you point me to the docs or the header files that have these kernels declared? I'm not sure how to go about finding them (a search on the doc-site came up blank).
Sorry, yeah documentation of this may be incomplete. There is some albeit python-oriented introduction to this in the tutorial,
https://github.com/cyclops-community/ctf/wiki/Tensor-introduction#multi-tensor-contraction
The currently available multi-tensor routines are mainly just MTTKRP and TTTP (generalization of SDDMM). I think @raghavendrak may also have some implementation of TTMc for third order tensors on another fork. This is the C++ interface file for them
https://github.com/cyclops-community/ctf/blob/master/src/interface/multilinear.h
I'm sorry Edgar, can you give me a little more guidance about how to use the MTTKRP kernel in multilinear.h to compute the kernel I'm looking at (a["i"] = B["ijk"] * c["k"]
)? I can't understand from the comments how this works (it seems like if B is 3-dimensional, then the other operands need to be two dimensional).
MTTKRP performs A["ir"] = B["ijk"]*C["kr"]*D["kr"]
so if A, C, D are n-by-1 and D["kr"]=1
(that should also be interpretable as a comamnd by CTF in C++) you would recover the operation you are interested in as a special case. I think the mat_list interface will accept vectors too and automatically do this (if not, you can call the reshape function).
Thanks for your help Edgar! I think that I've implemented this as you imagine:
int x = 12092;
int y = 9184;
int z = 28818;
int lens[] = {x, y, z};
Tensor<double> B(3, true /* is_sparse */, lens, dw);
Vector<double> a(x, dw);
Vector<double> c(z, dw);
Vector<double> d(z, dw);
c.fill_random(1.0, 1.0);
d.fill_random(1.0, 1.0);
a.fill_random(0.0, 0.0);
Tensor<double>* mats[] = {&a, &c, &d};
MTTKRP(&B, mats, 0, false);
I was unsure about the last argument, or where to put the output tensor, but it seemed to work out?
Using this specialized kernel removed a good portion of the overhead, bringing it from 2 seconds to 90ms (vs 10ms hand coded).
The output tensor is the 0th argument since you passed 0 to the kernel, so that looks fine (I assume B is filled with nonzeros elsewhere). We recorded 2-4X overhead relative to SPLATT MTTKRP, which uses CSF. Current implementation uses a less compact default sparse tensor format than CSF (we are working on integration of CSF into this and other kernels) and is minimally optimized.
One last question. If I understand correctly, the above approach isn't going to work out if I want to run a slightly different SpTTV kernel: a["ij"] = B["ijk"] * c["k"]
?
Yes, for that, just performing the CTF contraction as usual would be advisable. Choosing A to be sparse or dense will lead to different performance, depending on sparsity of B. The multi-tensor kernels we've introduced so far are driven by common applications in tensor decomposition methods.
Hi again edgar, I'm trying to use the MTTKRP function to actually compute an MTTKRP. Here's my code, which should be evaluating the contraction A[i, l] = B[i, j, k] * C[j, l] * D[k, l]
.
void mttkrp(int nIter, int warmup, std::string filename, std::vector<int> dims, World& dw, int ldim) {
Tensor<double> B(3, true /* is_sparse */, dims.data(), dw);
Matrix<double> A(dims[0], ldim, dw);
Matrix<double> C(dims[1], ldim, dw);
Matrix<double> D(dims[2], ldim, dw);
C.fill_random(1.0, 1.0);
D.fill_random(1.0, 1.0);
B.read_sparse_from_file(filename.c_str());
if (dw.rank == 0) {
std::cout << "Read " << B.nnz_tot << " non-zero entries from the file." << std::endl;
}
Tensor<double>* mats[] = {&A, &C, &D};
auto avgMs = benchmarkWithWarmup(warmup, nIter, [&]() {
MTTKRP(&B, mats, 0, false);
});
if (dw.rank == 0) {
std::cout << "Average execution time: " << avgMs << " ms." << std::endl;
}
}
This code segfaults with the following backtrace:
(gdb) bt
#0 0x00000000100302a4 in CTF_int::handler() ()
#1 0x00000000100190b4 in CTF::Semiring<double, true>::MTTKRP(int, long*, int*, long, long, int, bool, CTF::Pair<double> const*, double const* const*, double*) ()
#2 0x0000000010019a1c in void CTF::MTTKRP<double>(CTF::Tensor<double>*, CTF::Tensor<double>**, int, bool) ()
#3 0x000000001000ebf8 in std::_Function_handler<void (), mttkrp(int, int, std::string, std::vector<int, std::allocator<int> >, CTF::World&, int)::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()
#4 0x000000001000e740 in mttkrp(int, int, std::string, std::vector<int, std::allocator<int> >, CTF::World&, int) ()
#5 0x0000000010008a98 in main ()
Is there anything here that looks incorrect to you? This code was adapted from the working version you helped me with above.
Somewhat related, I was looking at the TTTP implementation, trying to implement SDDMM. Based on the comment at least, I'm not sure how to use the handwritten function to implement a contraction A[i, k] = B[i, k] * C[i, j] * D[j, k]
. I dont understand how to map the indices in the comment For ndim=3 and mat_list=[X,Y,Z], this operation is equivalent to einsum("ijk,ia,ja,ka->ijk",A,X,Y,Z)
to something like this.
I think the segfault is because you are using false
for aux_mode_first
(currently, there is no implementation available for aux_mode_first = false
).
Sample code for MTTKRP to achieve ijk,ja,ka->ia
:
Tensor<dtype> UT(2, false, lens, dw);
UT["ji"] = U["ij"];
Tensor<dtype> VT(2, false, lens, dw);
VT["ji"] = V["ij"];
Tensor<dtype> UC(2, false, lens, dw);
Tensor<dtype> * mlist[3] = {&UC, &UT, &VT};
MTTKRP<dtype>(&T, mlist, 0, true);
In my comment above (https://github.com/cyclops-community/ctf/issues/134#issuecomment-993948156), I used aux_mode_first = false, and that seemed to work in a different situation.
That seems to be the einsum notation expression I want, but I'm unsure why you are transposing the matrices.
In the comment you mentioned, the factors are vectors. Since there is only one index, aux_mode_first
false or true will be the same.
I think you are seeing the segfault from this:
https://github.com/cyclops-community/ctf/blob/c4f89dcfc6fce48b6d8da45204c9c309f3900a8f/src/interface/semiring.h#L1259
aux_mode_first = true
expects the auxiliary index (in the contraction) to show up first.
Thanks @raghavendrak! Following up, do you have an explanation why you transposed the matrices in your example?
Also, wondering if you have an answer to my question about TTTP/SDDMM.
@rohany I think Raghavendra had the transposes to compute the MTTKRP in the form you are looking for, i.e., sum{ij} T{ijk}U{ia}V{ja} since our routine evidently can currently only do sum{ij} T{ijk}U{ai}V{aj} (U and V) transposed. For SDDMM, you should use TTTP with an order 2 sparse tensor, performing the einsum("ij,ia,ja->ij",A,X,Y) to do the SDDMM Hadamard_product(A,X^TY). Let me know if all this makes sense.
Thanks, the MTTKRP transpose portion makes sense to me.
I'm still confused about SDDMM -- the einsum you wrote there is not isomorphic to what I want to compute: A[i, k] = B[i, k] * C[i, j] * D[j, k]
.
@rohany I think it is, you have "ik,ij,jk->ik", which is my einsum modulo transposition and substitutions j->k, a->j.
If I'm reading this correctly, you wrote einsum("ij,ia,ja->ij",A,X,Y)
, which translated to my notation is A[i, j] = A[i, j] * X[i, a] * Y[i, a]
, while my desired computation is A[i, k] = B[i, k] * C[i, j] * D[j, k]
. This is not isomorphic, as the operation X[i, a] * Y[i, a]
does not form a matrix multiplication between X and Y.
einsum("ij,ia,ja->ij",A,X,Y)
is A[i, j] = A[i, j] * X[i, a] * Y[j, a]
not A[i, j] = A[i, j] * X[i, a] * Y[i, a]
Yes, sorry. A[i, j] = A[i, j] * X[i, a] * Y[j, a]
is still not correct right, as X[i, a] * Y[j, a]
is not a matrix multiplication -- it should be X[i, a] * Y[a, j]
right?
I think X[i, a] * Y[j, a]
is a matrix multiplication (at least modulo transposition, as I mentioned in my post), i.e. it is XY^T
The CTF interface assumes all matrices (X,Y) are oriented in the same way, which makes much more sense in generalizing to tensors.
That make sense to me now. I will transpose my Y matrix to fit the interface.
Thanks for all the help! Code seems to be up and running now, so I'll close the issue.
Hi @solomonik, I was looking at this again:
The general sparse contraction/summation kernels are orders of magnitude slower than the multi-tensor MTTKRP kernel (see https://arxiv.org/pdf/2106.07135.pdf for details on this).
I was scanning the paper you linked, but I'm not finding discussion of distributed sparse MTTKRP in that paper -- is it the right link?
Its not, sorry about that, here is the right link https://arxiv.org/pdf/1910.02371.pdf
I'm looking at running some higher-order tensor contraction experiments in CTF, and want to know if the performance I'm seeing is reasonable, or if i'm using the library incorrectly. In particular, I'm running a modified SpTTV kernel (
a["i"] = B["ijk"] * c["k"]
), whereB
is a sparse tensor. On the nell-2 tensor, CTF takes around 2 seconds to complete the operation, when running with a rank per-core on a 40-core IBM power9. In comparison, a multi-threaded implementation generated by the TACO compiler takes around 10-15 milliseconds to run. I understand that this isn't the fairest comparison, as CTF has support for scaling to multiple nodes, but I don't know if such a performance difference is expected from a fast single-node implementation. However, I expect that CTF's approach to transform the data into CSR matrices to use pre-written kernels will lead to some slow-down compared to a compilation-based approach.The code I'm using to generate the CTF numbers is below: