Is TensorOperations able to take advantage of symmetry in the output?

Jutho / TensorOperations.jl

Julia package for tensor contractions and related operations

https://jutho.github.io/TensorOperations.jl/stable/

Other

438 stars 55 forks source link

Is TensorOperations able to take advantage of symmetry in the output? #142

Closed ejmeitz closed 11 months ago

ejmeitz commented 11 months ago

I have a rather large tensor contraction that I'm running on the GPU and I know that the output should be a symmetric tensor (permutation of indices does not affect value). I'd like to save the GPU RAM by only calculating the unique values but I do not think that is possible in the current interface. Is that something that is possible to add or hack my way around in the current version?

For the record, this library is a life saver in terms of speeding up my code.

Jutho commented 11 months ago

Taking index permutation symmetries into account is a hard challenge. Already for symmetric matrices the support in BLAS is very specific. For example, I don't think there is a gain for multiplying a symmetric matrix with a vector. Contracting tensors with particular index symmetries in a way that maximally exploit these symmetries (both computation wise and memory wise) is a very challenging problem, and not one that comes up very often in my work.

However, it might be nice to contemplate about a dedicated type for symmetric tensors, and specialise the operations that preserve them and can benefit from the structure.

ejmeitz commented 11 months ago

Yeah I have not seen many specific methods for symmetric tesnors. My main issue is more the RAM I am going to end up using and the symmetry was the most obvious thing to maybe lower that usage.

I don't have great intuition with tensor contraction stuff, is there someway to break up something like the contraction below so that I do not have to fully allocate K3 (or Psi) on the GPU. I think I always need all of Psi on the GPU but could chunk K3 and phi. K3[n,m,l] = Ψ[i,j,k] * phi[i,n] * phi[j,m] * phi[k,l]

Edit: Testing this now but does your SparseArrayKit library let me run on the GPU? Cause Psi is technically sparse.

Jutho commented 11 months ago

Even the simpler rank two case: K2[n,m] = Ψ[i,j] * phi[i,n] * phi[j,m] which amounts to K2 = transpose(phi) * psi * phi as matrix multiplication, with psi and thus K2 symmetric, is not trivial and there is no go-to BLAS approach for this (e.g. some random stackexchange: https://scicomp.stackexchange.com/questions/36105/efficient-change-of-basis-real-positive-definite-symmetric-matrix)

One thing that you could do is slice K3, e.g. divide every dimension in two, and figure out which blocks you do not need to calculate, i.e. you would only need to calculate blocks (1,1,1) (1,1,2) (1,2,2) (2,2,2)

instead of all 8 possible blocks. So that's already a saving of a factor of two. You could then also slice Psi in a similar way and write out by hand which steps in your calculation can be recycled.

For your final question: the SparseArrayKit is for very sparse arrays (and does indeed not support GPU, nor would it benefit from that). It is not meant to deal with structured sparseness of any kind.

ejmeitz commented 11 months ago

Cool I will try to slice the arrays up.

Yeah I learned the hard way trying to write a CUDA kernel for this (it is 99.9% sparse) and your library was still 1000x faster.

Jutho commented 11 months ago

Oh you mean that it is actually very sparse on top of being symmetric? I misunderstood the question then.

Have you tried using SparseArrayKit then, just without GPU?

ejmeitz commented 11 months ago

You didn't misunderstand I just didn't ask that in the original prompt.

I have tried SparseArrayKit it is much slower than just using the GPU version and ignoring the sparsity. I never really took the time to figure out why. I assumed it was because the phi matrix is still dense even if Psi is sparse. The GPU version takes ~50ms but the CPU version I never let finish because it was taking so much longer. I might just be doing something wrong in my invocation of the library though.

I just started re-running the CPU sparse version and it looks like significant parts of computation are stuck on a single thread.

Jutho commented 11 months ago

Ok, that sounds normal. SparseArrayKit is only good if all tensors are sparse. It doesn't have any specialised methods for sparse * dense.

ejmeitz commented 11 months ago

So I got the blocked implementation working no problem. The two tensors end up being slightly different and all of the differences are a factor of 1/8. This is small relative to the size of data in the test (~10^6) but it just seemed kind of weird they would not be exactly the same given the same inputs. I'll have to test with other inputs to see if this pattern is similar, I just found it super weird that the difference is always a factor of 1/8 and not just some random noise.

The blocked implementation is actually kind of easy and I think does end up saving time & resources. Not sure how this could be integrated into TensorOperations but might be useful.