Collect GeneralizedKroneckerProducts in-place

elisno commented 4 years ago

Using the vec-trick with higher order Kronecker products, such as

((τ1 ⊕ τ2) ⊗ (τ3 ⊕ τ4)) * vec(D)

should result in the lower-order products ( (τ1 ⊕ τ2) and (τ3 ⊕ τ4) ) being collected.

In my use case, the size of each (sparse) matrix is around N x N = 128 x 128. These matrices are stored in τ::Array{T<:Complex,4} with size(τ)= (N,N,M,L) . The sparsity pattern along M and L is assumed to be the same. This would be more efficient if (sparse) workspace matrices are used to store the intermediate results, reducing the number of allocations,

Last year, some work was started on kron! (https://github.com/JuliaLang/julia/pull/31069) that computed the Kronecker product in-place.

Should this implementation by used? If so, should we wait for the PR to go through, or write these methods in Kronecker.jl?
Would it be enough to apply this to KroneckerProducts or do we need methods that specialize on other types, such as KroneckerSums and IndexedKroneckerProducts?
Currently sparse collects KroneckerSums. Shouldn't it be lazy by default to accommodate these in-place methods? I'm not sure how to generalize over all types, but do something like:

sparse(K::KroneckerSum) = sparse(K.A) ⊕ sparse(K.B)

MichielStock commented 4 years ago

Using the vec-trick with higher order Kronecker products, such as
((τ1 ⊕ τ2) ⊗ (τ3 ⊕ τ4)) * vec(D)
should result in the lower-order products ( (τ1 ⊕ τ2) and (τ3 ⊕ τ4) ) being collected.

In my use case, the size of each (sparse) matrix is around N x N = 128 x 128. These matrices are stored in τ::Array{T<:Complex,4} with size(τ)= (N,N,M,L) . The sparsity pattern along M and L is assumed to be the same. This would be more efficient if (sparse) workspace matrices are used to store the intermediate results, reducing the number of allocations,

This is good idea. It is simple to either

update the vec trick, such that vector products with Kronecker products of two Kronecker sums (or one) immediately collects them into sparse matrices.
update Kronecker directly such that when one of the matrices is a Kronecker sum, it collects the sparse forms.

I am leaning towards the second option to be the better one. What do you think?

MichielStock commented 4 years ago

Last year, some work was started on kron! (JuliaLang/julia#31069) that computed the Kronecker product in-place.

Should this implementation by used? If so, should we wait for the PR to go through, or write these methods in Kronecker.jl?

Would it be enough to apply this to KroneckerProducts or do we need methods that specialize on other types, such as KroneckerSums and IndexedKroneckerProducts?

I like this idea. I guess we could add a function collectin!(C, K) or collect!(C, K), whatever is more idiomatic. Not sure what is the best way to use kron! as it is not shipped yet with the Base. We might copy the code for now and follow it up whenkron! becomes available in a stable version?

This trick will likely work with systems containing only a single Kronecker product, unless the code becomes more involved.

Currently sparse collects KroneckerSums. Shouldn't it be lazy by default to accommodate these in-place methods? I'm not sure how to generalize over all types, but do something like:
sparse(K::KroneckerSum) = sparse(K.A) ⊕ sparse(K.B)

Since A and B are often dense, I don't directly see how this would make sense. As I am aware, many of the shortcuts don't work with Kronecker sums as they work with Kronecker products. So, in many cases, you would want to work with the sparse version of the Kronecker sum, except for matrix exponentiation. Can you give some examples of how this would be beneficial?

elisno commented 4 years ago

This is good idea. It is simple to either

update the vec trick, such that vector products with Kronecker products of two Kronecker sums (or one) immediately collects them into sparse matrices.

update Kronecker directly such that when one of the matrices is a Kronecker sum, it collects the sparse forms.

I am leaning towards the second option to be the better one. What do you think?

Before, I've resorted to using the first option by collecting each sum (and manually writing out the vec-trick)

collect(τ3 ⊕ τ4) * lmul!(D, collect(transpose(τ1 ⊕ τ2)))

In my use case, I use lmul! because D is Diagonal. I

Each τ is essentially 4 versions of the sparse matrix t (found in https://gist.github.com/elisno/b49ad947046180917bc8da400b57ca7e).

The τs are operated on by transpose(), adjoint(), conj() (and identity(), technically). τ1 and τ2 are multiplied by a scalar.

Explicitly collecting the sums (out-of-place) becomes expensive because I have hundreds / thousands of t matrices that appear to have the same sparsity structure. I'd rather collect the non-zero values of t and update τx.nzval .= f(t.nzval) in some way.

My current version is hard to read and is not very robust.

Do I understand correctly that the second option would allow updating the non-zero values of τ1, τ2, τ3 and τ4 in-place? I.e. after updating, each term of Kronecker would still be a KroneckerSum? If so, this is a clever and clean solution. I'd most likely start using the "higher order vec-trick" with vec(D).

I guess we could add a function collectin!(C, K) or collect!(C, K), whatever is more idiomatic. Not sure what is the best way to use kron! as it is not shipped yet with the Base.

Perhaps a field for the matrix C should be added to KroneckerSum (similarily to the identity matrices discussed in #62)?

Since A and B are often dense, I don't directly see how this would make sense. As I am aware, many of the shortcuts don't work with Kronecker sums as they work with Kronecker products. So, in many cases, you would want to work with the sparse version of the Kronecker sum, except for matrix exponentiation. Can you give some examples of how this would be beneficial?

I'd say this is only important for collect/collect! when a Kronecker or KroneckerSum is composed of small dense matrices, based on the benchmarks in https://github.com/MichielStock/Kronecker.jl/pull/64#issue-414941737.

Admittedly, I don't have any specific usage for this, as I need to collect sums of larger, sparse matrices.

I'd rather consider this as a simple (optional) utility function we can use to store large (dense or sparse) matrices as sparse matrices in higher order products/sums. and allow the user to update them in-place in a manner of their choosing.

Here's a dummy example with calculating a linear combination of Kronecker products, Kronecker sums or the higher order example at the top of this issue.

# Small dense matrices

A = rand(4,4);
B = rand(3,3);
Anew = rand(4,4);
Bnew = rand(3,3);

result = kron(A,B) + kron(Anew, Bnew);
C = zeros(12,12);

K = A ⊗ B; # or ⊕
C .+= collect(K) # or an equivalent in-place method

K.A .= Anew;
K.B .= Bnew;

C .+= collect(K) # or an equivalent in-place method

C ≈ result

If the sizes of A , B, etc. increase, I'd think one would only have to change how K is constructed and initialized:

K = sparse(A ⊗ B) # or ⊕, should not automatically collect to a sparse matrix
...
K.A.nzval .= vec(Anew)
K.B.nzval .= vec(Bnew)

Roger-luo commented 4 years ago

Hi I came across from @elisno message on discourse and his comment under my PR https://github.com/JuliaLang/julia/pull/31069, I'll just reply both here so it's easier to keep track.

Generally speaking, the implementation of kron! and batched_kron! in our package Yao is mainly written for tests. https://github.com/QuantumBFS/YaoBase.jl/blob/master/src/utils/math.jl#L139

It was not well optimized for performance since in quantum circuits there are much more space to optimize the performance for quantum gates specifically. Thus sparse matrices type defined in LuxurySparse is not optimized specifically, but in principle by specializing kron! and batched_kron! on these matrice types should give a speedup. I'm glad to guide you guys if you want to work on this to make our implementation more general.

Regarding compatibility, you can depend on YaoBase for kron! interface compatibility across all Julia 1.x versions since we will need maintain that anyways.

elisno commented 4 years ago

I've looked over the kron/kron! and batched_kron/batched_kron! implementations in the Yao package suite. The performance of the dense products is quite impressive (particularly with batched_kron on the GPU).

I think it would be a good idea to use YaoBase for kron! until it's a part of Base.

In CuYao.jl, CuArray works with kron but I think it's not too difficult to add kron! as well. Kronecker.jl would finally have GPU compatibility if it depends on CuYao. https://github.com/QuantumBFS/CuYao.jl/blob/master/src/CUDApatch.jl#L75-L91

I'm not sure how it would work with higher order products / sums or how lazy batches would look like, but batched_kron! could be used when collecting a linear combination of KroneckerProducts or KroneckerSums.

@MichielStock, what are your thoughts on using using ⊗ and ⊕ for Array{T,3} (and hopefully CuArray{T,3})?

I'm still trying to figure out how to collect (sparse) KroneckerSums on GPUs.

The kernel for dense kron is straight forward: Each thread gets a specific state of the resulting product.
The kernel for batched_kron launches a thread for each batch.
I'm not sure how the kernel for kron would look for CuSparseMatrixCSC. Would the kernel replace the two outermost loops in the CPU implementation of sparse kron!?

Roger-luo commented 4 years ago

I think we could consider move all kron! and batched_kron! to LuxurySparse.jl which probably makes more sense, you could also support more matrix types from there. What do you guys think? @MichielStock @elisno

elisno commented 4 years ago

For this issue, it would make sense to use LuxurySparse.jl for sparse kron! only for the CPU. I'll have to think more about how batched_kron! could be used with the current Kronecker types.

I'll leave the GPU implementation for another issue later.

Roger-luo commented 4 years ago

Yeah, I think the best way to support GPU is to have a new package for GPUs specifically since the best way to support conditional dependency is to have a new package as I discussed with Stefan a year ago, tho there are some plans to support conditional dependency in Pkg.

So a new special array package made for GPUs specifically would make more sense.

MichielStock commented 4 years ago

This is good idea. It is simple to either

update the vec trick, such that vector products with Kronecker products of two Kronecker sums (or one) immediately collects them into sparse matrices.

update Kronecker directly such that when one of the matrices is a Kronecker sum, it collects the sparse forms.

I am leaning towards the second option to be the better one. What do you think?

Before, I've resorted to using the first option by collecting each sum (and manually writing out the vec-trick)
collect(τ3 ⊕ τ4) * lmul!(D, collect(transpose(τ1 ⊕ τ2)))
In my use case, I use lmul! because D is Diagonal. I

Each τ is essentially 4 versions of the sparse matrix t (found in https://gist.github.com/elisno/b49ad947046180917bc8da400b57ca7e).

The τs are operated on by transpose(), adjoint(), conj() (and identity(), technically). τ1 and τ2 are multiplied by a scalar.

Explicitly collecting the sums (out-of-place) becomes expensive because I have hundreds / thousands of t matrices that appear to have the same sparsity structure. I'd rather collect the non-zero values of t and update τx.nzval .= f(t.nzval) in some way.

Are you sure you need to use collect? If your expression only contains only a single Kronecker sum/product it is faster to use to vec trick, which it falls back to automatically.

MichielStock commented 4 years ago

Ok, I will summarize the discussion (as I see it)

we need a collectin!(C, K) function that collects GeneralizedKroneckerProducts in C, this should work for all kinds of Kronecker systems (kron prod, kron sum, kron exp etc).
I tend to add a new type of Kronecker for when you add multiple GeneralizedKroneckerProducts, i.e. A ⊗ B + C ⊗ D. In such cases, one can then use pre-allocated matrices to collect them.
If the above works, we might need a separate package to extend this to GPU.

Do I have the gist of it? If so, I would move these to respective issues.

Further thoughts:

not sure if the current package can support Array{T,3}. If this can be branched from the root of my type three in separate functionality, I am game.
kron! is ideally imported from another package.
also caution we don't try to emulate the functionality of TensorOperations.

MichielStock / Kronecker.jl

Collect GeneralizedKroneckerProducts in-place #61