Jutho / TensorOperations.jl

Julia package for tensor contractions and related operations
https://jutho.github.io/TensorOperations.jl/stable/
Other
438 stars 55 forks source link

np.einsum_path vs TensorOperations #156

Open ejmeitz opened 7 months ago

ejmeitz commented 7 months ago

Hello,

I was wondering if there is much difference in the contraction order given by numpy.einsum_path vs. what is being done inside of TensorOperations? I've looked at the output of @macroexpand and its not super clear to me what is being done. The specific contraction I'm looking at is: @macroexpand @tensor K_true[n,m,l] := psi[i,j,k]*P[i,n]*P[j,m]*P[k,l] and I'm just trying to understand how things are being multiplied under the hood so I can re-make this on GPU in a way that takes advantage of sparsity. I've coded what the numpy command outputs and TensorOperations is still much faster. I believe they have same computational complexity (4) and have similar memory footprints.

lkdvos commented 7 months ago

As far as the contraction you give is concerned, actually there is no optimization of contraction order that is automatically happening. In this case, the standard left to right multiplication is carried out. Nevertheless, you could ask for an optimized version, with various different measures to optimize over. You can find this in some more detail in this part of the documentation but if anything is not clear, do let us know!

For the rest, what is going on under the hood could be summarised as taking the given contraction, and rewriting it in terms of the pairwise operations that are defined in the interface. In that sense, by defining the functions of the interface for your sparse array type, you could just use tensoroperations as is. As a small side note, GPU support is already in there, so you could also have a look at the cuTENSOR extension if you want inspiration as to what it requires to define an new backend/support a new type.

Hope this helps!

ejmeitz commented 7 months ago

I guess left to right multiplication is still N^4 complexity so that's fine. @tensor and @tensoropt give the same results.

I also already do use the GPU version, but I'm running into memory issues as my tensors are kind of big (would be amazing if TensorOperations could automatically distribute an operation across GPUs). It isn't really faster to take advantage of sparsity but it sure as hell helps the memory out. I'm not sure I can overload the TensorOpt interface to use sparsity since I want to use the GPU. You have any recommendations for reducing memory on the GPU (that psi is 99.9% zero).

While I'm on the topic of memory I noticed that TensorOperations is missing out on some optimizations to reduce memory. I've pasted my two functions below. One is a simple one liner and the other is the exact same contraction but split into the parts that np.einsum prints out. I was able to re-use one of the intermediate arrays here that TensorOperations does not. This reduced the memory by 1/3.

For example,

function K_tensopt(psi, P)
    @tensoropt begin
        K_true[n,m,l] := psi[i,j,k]*P[i,n]*P[j,m]*P[k,l]
    end
end

image

Whereas this function which is equivalent (and also just the contractions that numpy spits out)

function K_tensopt_parts(psi, phi)

    @tensor begin
        K1[j,k,n] := psi[i,j,k]*phi[i,n]
    end

    @tensor begin
        K2[k,m,n] := K1[j,k,n]*phi[j,m]
    end 

    fill!(K1, 0.0)

    @tensor begin
        K1[n,m,l] = K2[k,m,n]*phi[k,l]
    end

    return K1
end

image

lkdvos commented 7 months ago

Thanks for showing that, that's quite interesting! We've never really considered being able to reuse memory from the temporary intermediate objects within the same contraction. The biggest struggle with that is that the @tensor macro works at compile time, which means it does not have access to any of the array sizes. As far as I understand, the only reason you can reuse that memory is if you know that the all these indices have the same dimensions, which is not true in generic contractions. Nevertheless, in this case you do have the additional information that at least some of the indices are equal, because your input tensors are used repeatedly, so there might be something you can do there. Sadly, this is not something we have implemented, and I am not sure if there is a very quick fix.

Considering the issue with memory usage. This is definitely an active and ongoing topic of interest. We rewrote part of the machinery to be able to experiment with different backends and allocation strategies, specifically to be able to address this issue. I would recommend having a look at AllocationKit.jl, where I have put some more information as well. They are primarily focussed on CPUs at the moment, but it might be the case that one of these approaches also works for reducing the GPU memory pressure.

In any case, if you do figure it out, or have some more questions, definitely don't hesitate to contact me, I am also very interested, both in working with GPUs as well as dealing with memory pressure.