Open elisno opened 4 years ago
Using the vec-trick with higher order Kronecker products, such as
((τ1 ⊕ τ2) ⊗ (τ3 ⊕ τ4)) * vec(D)
should result in the lower-order products (
(τ1 ⊕ τ2)
and(τ3 ⊕ τ4)
) being collected.In my use case, the size of each (sparse) matrix is around
N x N = 128 x 128
. These matrices are stored inτ::Array{T<:Complex,4}
withsize(τ)= (N,N,M,L)
. The sparsity pattern alongM
andL
is assumed to be the same. This would be more efficient if (sparse) workspace matrices are used to store the intermediate results, reducing the number of allocations,
This is good idea. It is simple to either
I am leaning towards the second option to be the better one. What do you think?
Last year, some work was started on
kron!
(JuliaLang/julia#31069) that computed the Kronecker product in-place.
- Should this implementation by used? If so, should we wait for the PR to go through, or write these methods in
Kronecker.jl
?- Would it be enough to apply this to
KroneckerProduct
s or do we need methods that specialize on other types, such asKroneckerSum
s andIndexedKroneckerProduct
s?
I like this idea. I guess we could add a function collectin!(C, K)
or collect!(C, K)
, whatever is more idiomatic. Not sure what is the best way to use kron!
as it is not shipped yet with the Base. We might copy the code for now and follow it up whenkron!
becomes available in a stable version?
This trick will likely work with systems containing only a single Kronecker product, unless the code becomes more involved.
- Currently
sparse
collectsKroneckerSum
s. Shouldn't it be lazy by default to accommodate these in-place methods? I'm not sure how to generalize over all types, but do something like:sparse(K::KroneckerSum) = sparse(K.A) ⊕ sparse(K.B)
Since A
and B
are often dense, I don't directly see how this would make sense. As I am aware, many of the shortcuts don't work with Kronecker sums as they work with Kronecker products. So, in many cases, you would want to work with the sparse version of the Kronecker sum, except for matrix exponentiation. Can you give some examples of how this would be beneficial?
This is good idea. It is simple to either
- update the vec trick, such that vector products with Kronecker products of two Kronecker sums (or one) immediately collects them into sparse matrices.
- update Kronecker directly such that when one of the matrices is a Kronecker sum, it collects the sparse forms.
I am leaning towards the second option to be the better one. What do you think?
Before, I've resorted to using the first option by collecting each sum (and manually writing out the vec-trick)
collect(τ3 ⊕ τ4) * lmul!(D, collect(transpose(τ1 ⊕ τ2)))
In my use case, I use lmul!
because D
is Diagonal
. I
Each τ
is essentially 4 versions of the sparse matrix t
(found in https://gist.github.com/elisno/b49ad947046180917bc8da400b57ca7e).
The τ
s are operated on by transpose()
, adjoint()
, conj()
(and identity()
, technically). τ1
and τ2
are multiplied by a scalar.
Explicitly collecting the sums (out-of-place) becomes expensive because I have hundreds / thousands of t
matrices that appear to have the same sparsity structure.
I'd rather collect the non-zero values of t
and update τx.nzval .= f(t.nzval)
in some way.
My current version is hard to read and is not very robust.
Do I understand correctly that the second option would allow updating the non-zero values of τ1
, τ2
, τ3
and τ4
in-place? I.e. after updating, each term of Kronecker would still be a KroneckerSum? If so, this is a clever and clean solution. I'd most likely start using the "higher order vec-trick" with vec(D)
.
I guess we could add a function
collectin!(C, K)
orcollect!(C, K)
, whatever is more idiomatic. Not sure what is the best way to usekron!
as it is not shipped yet with the Base.
Perhaps a field for the matrix C
should be added to KroneckerSum (similarily to the identity matrices discussed in #62)?
Since
A
andB
are often dense, I don't directly see how this would make sense. As I am aware, many of the shortcuts don't work with Kronecker sums as they work with Kronecker products. So, in many cases, you would want to work with the sparse version of the Kronecker sum, except for matrix exponentiation. Can you give some examples of how this would be beneficial?
I'd say this is only important for collect
/collect!
when a Kronecker or KroneckerSum is composed of small dense matrices, based on the benchmarks in https://github.com/MichielStock/Kronecker.jl/pull/64#issue-414941737.
Admittedly, I don't have any specific usage for this, as I need to collect sums of larger, sparse matrices.
I'd rather consider this as a simple (optional) utility function we can use to store large (dense or sparse) matrices as sparse matrices in higher order products/sums. and allow the user to update them in-place in a manner of their choosing.
Here's a dummy example with calculating a linear combination of Kronecker products, Kronecker sums or the higher order example at the top of this issue.
# Small dense matrices
A = rand(4,4);
B = rand(3,3);
Anew = rand(4,4);
Bnew = rand(3,3);
result = kron(A,B) + kron(Anew, Bnew);
C = zeros(12,12);
K = A ⊗ B; # or ⊕
C .+= collect(K) # or an equivalent in-place method
K.A .= Anew;
K.B .= Bnew;
C .+= collect(K) # or an equivalent in-place method
C ≈ result
If the sizes of A
, B
, etc. increase, I'd think one would only have to change how K
is constructed and initialized:
K = sparse(A ⊗ B) # or ⊕, should not automatically collect to a sparse matrix
...
K.A.nzval .= vec(Anew)
K.B.nzval .= vec(Bnew)
Hi I came across from @elisno message on discourse and his comment under my PR https://github.com/JuliaLang/julia/pull/31069, I'll just reply both here so it's easier to keep track.
Generally speaking, the implementation of kron!
and batched_kron!
in our package Yao is mainly written for tests. https://github.com/QuantumBFS/YaoBase.jl/blob/master/src/utils/math.jl#L139
It was not well optimized for performance since in quantum circuits there are much more space to optimize the performance for quantum gates specifically. Thus sparse matrices type defined in LuxurySparse is not optimized specifically, but in principle by specializing kron!
and batched_kron!
on these matrice types should give a speedup. I'm glad to guide you guys if you want to work on this to make our implementation more general.
Regarding compatibility, you can depend on YaoBase
for kron!
interface compatibility across all Julia 1.x versions since we will need maintain that anyways.
I've looked over the kron
/kron!
and batched_kron
/batched_kron!
implementations in the Yao package suite. The performance of the dense products is quite impressive (particularly with batched_kron
on the GPU).
I think it would be a good idea to use YaoBase
for kron!
until it's a part of Base
.
In CuYao.jl, CuArray
works with kron
but I think it's not too difficult to add kron!
as well. Kronecker.jl would finally have GPU compatibility if it depends on CuYao.
https://github.com/QuantumBFS/CuYao.jl/blob/master/src/CUDApatch.jl#L75-L91
I'm not sure how it would work with higher order products / sums or how lazy batches would look like, but batched_kron!
could be used when collecting a linear combination of KroneckerProducts or KroneckerSums.
@MichielStock, what are your thoughts on using using ⊗
and ⊕
for Array{T,3}
(and hopefully CuArray{T,3}
)?
I'm still trying to figure out how to collect (sparse) KroneckerSums on GPUs.
The kernel for dense kron
is straight forward: Each thread gets a specific state
of the resulting product.
The kernel for batched_kron
launches a thread for each batch.
I'm not sure how the kernel for kron
would look for CuSparseMatrixCSC
. Would the kernel replace the two outermost loops in the CPU implementation of sparse kron!
?
I think we could consider move all kron!
and batched_kron!
to LuxurySparse.jl which probably makes more sense, you could also support more matrix types from there. What do you guys think? @MichielStock @elisno
For this issue, it would make sense to use LuxurySparse.jl for sparse kron!
only for the CPU. I'll have to think more about how batched_kron!
could be used with the current Kronecker types.
I'll leave the GPU implementation for another issue later.
Yeah, I think the best way to support GPU is to have a new package for GPUs specifically since the best way to support conditional dependency is to have a new package as I discussed with Stefan a year ago, tho there are some plans to support conditional dependency in Pkg.
So a new special array package made for GPUs specifically would make more sense.
This is good idea. It is simple to either
- update the vec trick, such that vector products with Kronecker products of two Kronecker sums (or one) immediately collects them into sparse matrices.
- update Kronecker directly such that when one of the matrices is a Kronecker sum, it collects the sparse forms.
I am leaning towards the second option to be the better one. What do you think?
Before, I've resorted to using the first option by collecting each sum (and manually writing out the vec-trick)
collect(τ3 ⊕ τ4) * lmul!(D, collect(transpose(τ1 ⊕ τ2)))
In my use case, I use
lmul!
becauseD
isDiagonal
. IEach
τ
is essentially 4 versions of the sparse matrixt
(found in https://gist.github.com/elisno/b49ad947046180917bc8da400b57ca7e).The
τ
s are operated on bytranspose()
,adjoint()
,conj()
(andidentity()
, technically).τ1
andτ2
are multiplied by a scalar.Explicitly collecting the sums (out-of-place) becomes expensive because I have hundreds / thousands of
t
matrices that appear to have the same sparsity structure. I'd rather collect the non-zero values oft
and updateτx.nzval .= f(t.nzval)
in some way.
Are you sure you need to use collect
? If your expression only contains only a single Kronecker sum/product it is faster to use to vec trick, which it falls back to automatically.
Ok, I will summarize the discussion (as I see it)
collectin!(C, K)
function that collects GeneralizedKroneckerProducts in C
, this should work for all kinds of Kronecker systems (kron prod, kron sum, kron exp etc).GeneralizedKroneckerProducts
, i.e. A ⊗ B + C ⊗ D
. In such cases, one can then use pre-allocated matrices to collect them.Do I have the gist of it? If so, I would move these to respective issues.
Further thoughts:
Array{T,3}
. If this can be branched from the root of my type three in separate functionality, I am game.kron!
is ideally imported from another package.
Using the vec-trick with higher order Kronecker products, such as
should result in the lower-order products (
(τ1 ⊕ τ2)
and(τ3 ⊕ τ4)
) being collected.In my use case, the size of each (sparse) matrix is around
N x N = 128 x 128
. These matrices are stored inτ::Array{T<:Complex,4}
withsize(τ)= (N,N,M,L)
. The sparsity pattern alongM
andL
is assumed to be the same. This would be more efficient if (sparse) workspace matrices are used to store the intermediate results, reducing the number of allocations,Last year, some work was started on
kron!
(https://github.com/JuliaLang/julia/pull/31069) that computed the Kronecker product in-place.Should this implementation by used? If so, should we wait for the PR to go through, or write these methods in
Kronecker.jl
?Would it be enough to apply this to
KroneckerProduct
s or do we need methods that specialize on other types, such asKroneckerSum
s andIndexedKroneckerProduct
s?Currently
sparse
collectsKroneckerSum
s. Shouldn't it be lazy by default to accommodate these in-place methods? I'm not sure how to generalize over all types, but do something like: