Closed ronisbr closed 4 months ago
Hi!
This is a complicated problem. So far no one has implemented the right algorithm in StaticArrays.jl. See here for the 3x3 case for example: https://pages.cs.wisc.edu/%7Esifakis/papers/SVD_TR1690.pdf . If you want to work on it, I could review your code but I don't have time to work on it myself.
Thanks @mateuszbaran !
Implementing state-of-the-art algorithm is indeed very difficult (I was reading BLAS code). However, since SMatrix
are supposed to be small, I am wondering if we can implement a naive approach. Since it will not allocate, maybe we can have a performance better than what we have now (converting to Matrix
and using BLAS). What do you think?
Just for comparison, I implemented a QR-based SVD decomposition and it was 50% faster than the current svd
for matrices 4 x 4. However, it became slower when the dimension was higher than 7. I have no idea if such approach would be good here. What do you think?
Eigen uses the complete ortogonal decomposition to compute pseudo-inverse, which seems a little better than computing SVD for this application. However, it requires pivoted QR, which StaticArrays also uses the algorithm in Base
by converting to Matrix
:)
Now, let's see how difficult it is to implement pivoted QR here.
On second thoughts, although having a Julia version of most BLAS functions would be awesome, it is really an Herculean task. However, since Julia has such a good interface with BLAS (svd
, qr
, etc all use functions from it). I was thinking: "why can't we have a direct interface here in StaticArrays but trying to avoid allocations?"
Since some version of Julia, the compiler became smart enough to avoid all allocations using MMatrix
if nothing gets out the scope and if you convert to SMatrix
at end. Hence, I created a simple code to compute the SVD for static matrices using this approach:
function blas_svd(A::StaticMatrix{N, M, T}) where {N, M, T}
K = min(N, M)
# Convert the input to a `MMatrix` and allocate the
Am = MMatrix{N, M, T}(A)
Um = MMatrix{N, K, T}(undef)
Sm = MVector{K, T}(undef)
Vtm = MMatrix{M, K, T}(undef)
work = MVector{1_000, Float64}(undef)
lwork = -1
iwork = MVector{8K, Int64}(undef)
info = Ref(1)
# Call the function to obtain the optimum size of the work array.
ccall(
(BLAS.@blasfunc(dgesvd_), BLAS.libblas),
Cvoid,
(
Ref{UInt8},
Ref{UInt8},
Ref{BLAS.BlasInt},
Ref{BLAS.BlasInt},
Ptr{T},
Ref{BLAS.BlasInt},
Ptr{T},
Ptr{T},
Ref{BLAS.BlasInt},
Ptr{T},
Ref{BLAS.BlasInt},
Ptr{T},
Ref{BLAS.BlasInt},
Ref{BLAS.BlasInt},
),
'S',
'S',
N,
M,
Am,
N,
Sm,
Um,
N,
Vtm,
M,
work,
lwork,
info
)
# Obtain the optimal work array size and compute the SVD.
lwork = round(BLAS.BlasInt, work[1])
ccall(
(BLAS.@blasfunc(dgesvd_), BLAS.libblas),
Cvoid,
(
Ref{UInt8},
Ref{UInt8},
Ref{BLAS.BlasInt},
Ref{BLAS.BlasInt},
Ptr{T},
Ref{BLAS.BlasInt},
Ptr{T},
Ptr{T},
Ref{BLAS.BlasInt},
Ptr{T},
Ref{BLAS.BlasInt},
Ptr{T},
Ref{BLAS.BlasInt},
Ref{BLAS.BlasInt},
),
'S',
'S',
N,
M,
Am,
N,
Sm,
Um,
N,
Vtm,
M,
work,
lwork,
info
)
# Convert the matrices to static arrays and return.
U = SMatrix{N, K, T}(Um)
S = SVector{K, T}(Sm)
Vt = SMatrix{M, K, T}(Vtm)
return U, S, Vt'
end
The result is the same as using svd
if we use square matrices (I did not debug for non-square matrices yet):
julia> A = @SMatrix randn(6, 6)
6×6 SMatrix{6, 6, Float64, 36} with indices SOneTo(6)×SOneTo(6):
-0.981944 -1.12979 0.310915 0.179723 -0.667651 0.182524
1.09949 0.522666 -0.0878819 -0.0383786 -1.17082 -0.219113
-1.07534 0.961367 -1.12684 -0.23116 -0.0602622 -1.38193
0.0982899 -0.678585 -2.14742 0.164939 -1.12818 1.33526
-0.60989 1.2469 -0.238621 -0.237353 0.695344 -0.514536
-1.39276 0.353875 -0.551829 -0.283986 0.207744 -1.20008
julia> U1, S1, V1 = svd(A);
julia> U2, S2, V2 = blas_svd(A);
julia> U1
6×6 SMatrix{6, 6, Float64, 36} with indices SOneTo(6)×SOneTo(6):
0.127537 0.115343 0.695209 0.470797 0.492761 -0.150506
0.155633 0.0389598 -0.624687 0.696828 0.231023 0.212336
-0.542549 0.463331 -0.138643 0.219714 -0.173881 -0.627084
0.479565 0.825676 -0.0673256 -0.264451 0.0609473 0.100476
-0.432199 0.0425817 -0.20474 -0.379913 0.784298 0.100079
-0.498372 0.294871 0.246531 0.173895 -0.233973 0.720358
julia> U2
6×6 SMatrix{6, 6, Float64, 36} with indices SOneTo(6)×SOneTo(6):
0.127537 0.115343 0.695209 0.470797 0.492761 -0.150506
0.155633 0.0389598 -0.624687 0.696828 0.231023 0.212336
-0.542549 0.463331 -0.138643 0.219714 -0.173881 -0.627084
0.479565 0.825676 -0.0673256 -0.264451 0.0609473 0.100476
-0.432199 0.0425817 -0.20474 -0.379913 0.784298 0.100079
-0.498372 0.294871 0.246531 0.173895 -0.233973 0.720358
julia> V1
6×6 SMatrix{6, 6, Float64, 36} with indices SOneTo(6)×SOneTo(6):
0.47655 -0.331218 -0.695862 0.0226213 -0.422302 0.0109475
-0.47389 -0.0241441 -0.658351 -0.137571 0.559031 -0.0998872
-0.00418107 -0.872921 0.233316 0.290965 0.312066 0.038845
0.135771 -0.0163176 0.071462 0.00309199 0.0228163 -0.987756
-0.344044 -0.355242 0.125355 -0.780332 -0.359093 -0.0430901
0.641491 0.0354597 0.0842224 -0.535699 0.531311 0.104279
julia> V2
6×6 SMatrix{6, 6, Float64, 36} with indices SOneTo(6)×SOneTo(6):
0.47655 -0.331218 -0.695862 0.0226213 -0.422302 0.0109475
-0.47389 -0.0241441 -0.658351 -0.137571 0.559031 -0.0998872
-0.00418107 -0.872921 0.233316 0.290965 0.312066 0.038845
0.135771 -0.0163176 0.071462 0.00309199 0.0228163 -0.987756
-0.344044 -0.355242 0.125355 -0.780332 -0.359093 -0.0430901
0.641491 0.0354597 0.0842224 -0.535699 0.531311 0.104279
julia> S1
6-element SVector{6, Float64} with indices SOneTo(6):
3.429126570487767
2.790190909977814
2.077297239613754
1.3752132074075802
0.4483791256059501
0.10325942415912782
julia> S2
6-element SVector{6, Float64} with indices SOneTo(6):
3.429126570487767
2.790190909977814
2.077297239613754
1.3752132074075802
0.4483791256059501
0.10325942415912782
Now, let's analyze the allocation and execution time for different matrix sizes between the two approaches:
Rows | Columns | `svd` time | `svd` alloc. | `blas_svd` time | `blas_svd` alloc | Gain [%] |
---|---|---|---|---|---|---|
2 | 2 | 0.954 | 8 | 0.64 | 0 | 32.914 |
3 | 3 | 1.488 | 8 | 1.175 | 0 | 21.0349 |
4 | 4 | 2.296 | 8 | 1.908 | 0 | 16.899 |
5 | 5 | 3.042 | 8 | 2.653 | 0 | 12.7876 |
6 | 6 | 3.713 | 8 | 3.26 | 0 | 12.2004 |
7 | 7 | 5.312 | 8 | 4.774 | 0 | 10.128 |
8 | 8 | 5.923 | 8 | 5.147 | 0 | 13.1015 |
9 | 9 | 7.25 | 8 | 6.683 | 0 | 7.82069 |
10 | 10 | 9.166 | 8 | 8.43 | 0 | 8.02967 |
Notice that we have a noticeable gain with small matrices without any allocation (my use case).
@mateuszbaran If I can improve the algorithm to work for all matrix sizes, would this approach be useful for StaticArrays.jl?
For my project, I will implement this type of function to compute the pinv
in my case.
@mateuszbaran If I can improve the algorithm to work for all matrix sizes, would this approach be useful for StaticArrays.jl?
Hmm, that's a significant amount of additional code with relatively limited gains. I'd prefer to have a second opinion if it's worth it.
Directly calling BLAS would have to be done a bit more carefully than your blas_svd
. First, I would limit it to blas_svd(A::StaticMatrix{N, M, Float64})
.
Call the function to obtain the optimum size of the work array.
Querying doesn't make sense if you don't allocate the requested amount of memory on the heap. You have a fixed size there (1000 elements) and that what you have to provide. I'd suggest, though, a different approach: using a formula from LAPACK docs (MAX(3MIN(M,N)+MAX(M,N),5MIN(M,N)) or something larger), see here: https://docs.oracle.com/cd/E19422-01/819-3691/dgesvd.html .
I'd also suggest using LinearAlgebra.LAPACK.chklapackerror
to detect any possible errors.
Hi @mateuszbaran!
Thanks for the tips.
Hmm, that's a significant amount of additional code with relatively limited gains. I'd prefer to have a second opinion if it's worth it.
I see. Maybe I will create a package called StaticArrayBlasInterface to add those algorithms (important to my use) and after we can decide if we migrate them to here.
Hi @mateuszbaran!
I finished the initial code of the package (still needs polishing and error verification):
https://github.com/ronisbr/StaticArraysBlasInterfaces.jl
It is working perfectly for Float64
and Float32
matrices. I overloaded only the svd
function, but it also provided allocation-free pinv
(because computing the SVD was the only part in pinv
that allocates).
I also opened a thread in Discourse to discuss more about the implementation:
Thanks!
Hi! Making a new package is a good solution for now. I'm not knowledgeable about every aspect of StaticArrays and linear algebra and maintainer availability here is sometimes less then ideal, so working around it is necessary sometimes.
I think it is fine to include it here. The BLAS interface file in LinearAlgebra have basically not changed since the start and I expect the same for this new code so shouldn't require any maintenance. It is a strict improvement and StaticArrays already depend on LinearAlgebra too.
OK, then let's put it in StaticArrays.jl.
Would be good to at least measure the impact on using StaticArrays
with/without the additional code though.
Perfect! I will submit the PR and perform all the analysis.
Done @mateuszbaran and @fredrikekre ! I submitted the PR for review. If it got merged, I can also implement the same approach for other functions that allocates due to call functions in Julia Base (like qr
).
Thank you very much @mateuszbaran and @fredrikekre for the help adding this feature to StaticArrays.jl :)
Thank you for your contribution :slightly_smiling_face:
Hi!
When computing the SVD of a static matrix, we see some allocations:
which is probably caused because the algorithm just converts the input to
Matrix
and call thesvd
inBase
.The problem here is that
svd
is used to computepinv
that is somewhat very used when inverting matrices for embedded Kalman filters.I am pursing to embed a full satellite attitude control algorithm written entirely in Julia. However, I must avoid all allocations. In this specific case, I have not yet found a workaround to avoid allocations. Can anyone help me?