JuliaArrays / StaticArrays.jl

Statically sized arrays for Julia
Other
762 stars 148 forks source link

svd allocates even with small matrices #1255

Closed ronisbr closed 4 months ago

ronisbr commented 4 months ago

Hi!

When computing the SVD of a static matrix, we see some allocations:

julia> using StaticArrays, BenchmarkTools

julia> S = @SMatrix rand(3, 3);

julia> @btime StaticArrays.svd($S)
  1.333 μs (8 allocations: 2.48 KiB)

which is probably caused because the algorithm just converts the input to Matrix and call the svd in Base.

The problem here is that svd is used to compute pinv that is somewhat very used when inverting matrices for embedded Kalman filters.

I am pursing to embed a full satellite attitude control algorithm written entirely in Julia. However, I must avoid all allocations. In this specific case, I have not yet found a workaround to avoid allocations. Can anyone help me?

mateuszbaran commented 4 months ago

Hi!

This is a complicated problem. So far no one has implemented the right algorithm in StaticArrays.jl. See here for the 3x3 case for example: https://pages.cs.wisc.edu/%7Esifakis/papers/SVD_TR1690.pdf . If you want to work on it, I could review your code but I don't have time to work on it myself.

ronisbr commented 4 months ago

Thanks @mateuszbaran !

Implementing state-of-the-art algorithm is indeed very difficult (I was reading BLAS code). However, since SMatrix are supposed to be small, I am wondering if we can implement a naive approach. Since it will not allocate, maybe we can have a performance better than what we have now (converting to Matrix and using BLAS). What do you think?

ronisbr commented 4 months ago

Just for comparison, I implemented a QR-based SVD decomposition and it was 50% faster than the current svd for matrices 4 x 4. However, it became slower when the dimension was higher than 7. I have no idea if such approach would be good here. What do you think?

ronisbr commented 4 months ago

Eigen uses the complete ortogonal decomposition to compute pseudo-inverse, which seems a little better than computing SVD for this application. However, it requires pivoted QR, which StaticArrays also uses the algorithm in Base by converting to Matrix :)

Now, let's see how difficult it is to implement pivoted QR here.

ronisbr commented 4 months ago

On second thoughts, although having a Julia version of most BLAS functions would be awesome, it is really an Herculean task. However, since Julia has such a good interface with BLAS (svd, qr, etc all use functions from it). I was thinking: "why can't we have a direct interface here in StaticArrays but trying to avoid allocations?"

Since some version of Julia, the compiler became smart enough to avoid all allocations using MMatrix if nothing gets out the scope and if you convert to SMatrix at end. Hence, I created a simple code to compute the SVD for static matrices using this approach:

function blas_svd(A::StaticMatrix{N, M, T}) where {N, M, T}
    K = min(N, M)

    # Convert the input to a `MMatrix` and allocate the 
    Am  = MMatrix{N, M, T}(A)
    Um  = MMatrix{N, K, T}(undef)
    Sm  = MVector{K, T}(undef)
    Vtm = MMatrix{M, K, T}(undef)

    work  = MVector{1_000, Float64}(undef)
    lwork = -1
    iwork = MVector{8K, Int64}(undef)
    info  = Ref(1)

    # Call the function to obtain the optimum size of the work array.
    ccall(
        (BLAS.@blasfunc(dgesvd_), BLAS.libblas),
        Cvoid,
        (
            Ref{UInt8},
            Ref{UInt8},
            Ref{BLAS.BlasInt},
            Ref{BLAS.BlasInt},
            Ptr{T},
            Ref{BLAS.BlasInt},
            Ptr{T},
            Ptr{T},
            Ref{BLAS.BlasInt},
            Ptr{T},
            Ref{BLAS.BlasInt},
            Ptr{T},
            Ref{BLAS.BlasInt},
            Ref{BLAS.BlasInt},
        ),
        'S',
        'S',
        N,
        M,
        Am,
        N,
        Sm,
        Um,
        N,
        Vtm,
        M,
        work,
        lwork,
        info
    )

    # Obtain the optimal work array size and compute the SVD.
    lwork = round(BLAS.BlasInt, work[1])
    ccall(
        (BLAS.@blasfunc(dgesvd_), BLAS.libblas),
        Cvoid,
        (
            Ref{UInt8},
            Ref{UInt8},
            Ref{BLAS.BlasInt},
            Ref{BLAS.BlasInt},
            Ptr{T},
            Ref{BLAS.BlasInt},
            Ptr{T},
            Ptr{T},
            Ref{BLAS.BlasInt},
            Ptr{T},
            Ref{BLAS.BlasInt},
            Ptr{T},
            Ref{BLAS.BlasInt},
            Ref{BLAS.BlasInt},
        ),
        'S',
        'S',
        N,
        M,
        Am,
        N,
        Sm,
        Um,
        N,
        Vtm,
        M,
        work,
        lwork,
        info
    )

    # Convert the matrices to static arrays and return.
    U  = SMatrix{N, K, T}(Um)
    S  = SVector{K, T}(Sm)
    Vt = SMatrix{M, K, T}(Vtm)

    return U, S, Vt'
end

The result is the same as using svd if we use square matrices (I did not debug for non-square matrices yet):

julia> A = @SMatrix randn(6, 6)
6×6 SMatrix{6, 6, Float64, 36} with indices SOneTo(6)×SOneTo(6):
 -0.981944   -1.12979    0.310915    0.179723   -0.667651    0.182524
  1.09949     0.522666  -0.0878819  -0.0383786  -1.17082    -0.219113
 -1.07534     0.961367  -1.12684    -0.23116    -0.0602622  -1.38193
  0.0982899  -0.678585  -2.14742     0.164939   -1.12818     1.33526
 -0.60989     1.2469    -0.238621   -0.237353    0.695344   -0.514536
 -1.39276     0.353875  -0.551829   -0.283986    0.207744   -1.20008

julia> U1, S1, V1 = svd(A);

julia> U2, S2, V2 = blas_svd(A);

julia> U1
6×6 SMatrix{6, 6, Float64, 36} with indices SOneTo(6)×SOneTo(6):
  0.127537  0.115343    0.695209    0.470797   0.492761   -0.150506
  0.155633  0.0389598  -0.624687    0.696828   0.231023    0.212336
 -0.542549  0.463331   -0.138643    0.219714  -0.173881   -0.627084
  0.479565  0.825676   -0.0673256  -0.264451   0.0609473   0.100476
 -0.432199  0.0425817  -0.20474    -0.379913   0.784298    0.100079
 -0.498372  0.294871    0.246531    0.173895  -0.233973    0.720358

julia> U2
6×6 SMatrix{6, 6, Float64, 36} with indices SOneTo(6)×SOneTo(6):
  0.127537  0.115343    0.695209    0.470797   0.492761   -0.150506
  0.155633  0.0389598  -0.624687    0.696828   0.231023    0.212336
 -0.542549  0.463331   -0.138643    0.219714  -0.173881   -0.627084
  0.479565  0.825676   -0.0673256  -0.264451   0.0609473   0.100476
 -0.432199  0.0425817  -0.20474    -0.379913   0.784298    0.100079
 -0.498372  0.294871    0.246531    0.173895  -0.233973    0.720358

julia> V1
6×6 SMatrix{6, 6, Float64, 36} with indices SOneTo(6)×SOneTo(6):
  0.47655     -0.331218   -0.695862    0.0226213   -0.422302    0.0109475
 -0.47389     -0.0241441  -0.658351   -0.137571     0.559031   -0.0998872
 -0.00418107  -0.872921    0.233316    0.290965     0.312066    0.038845
  0.135771    -0.0163176   0.071462    0.00309199   0.0228163  -0.987756
 -0.344044    -0.355242    0.125355   -0.780332    -0.359093   -0.0430901
  0.641491     0.0354597   0.0842224  -0.535699     0.531311    0.104279

julia> V2
6×6 SMatrix{6, 6, Float64, 36} with indices SOneTo(6)×SOneTo(6):
  0.47655     -0.331218   -0.695862    0.0226213   -0.422302    0.0109475
 -0.47389     -0.0241441  -0.658351   -0.137571     0.559031   -0.0998872
 -0.00418107  -0.872921    0.233316    0.290965     0.312066    0.038845
  0.135771    -0.0163176   0.071462    0.00309199   0.0228163  -0.987756
 -0.344044    -0.355242    0.125355   -0.780332    -0.359093   -0.0430901
  0.641491     0.0354597   0.0842224  -0.535699     0.531311    0.104279

julia> S1
6-element SVector{6, Float64} with indices SOneTo(6):
 3.429126570487767
 2.790190909977814
 2.077297239613754
 1.3752132074075802
 0.4483791256059501
 0.10325942415912782

julia> S2
6-element SVector{6, Float64} with indices SOneTo(6):
 3.429126570487767
 2.790190909977814
 2.077297239613754
 1.3752132074075802
 0.4483791256059501
 0.10325942415912782

Now, let's analyze the allocation and execution time for different matrix sizes between the two approaches:

Rows Columns `svd` time `svd` alloc. `blas_svd` time `blas_svd` alloc Gain [%]
2 2 0.954 8 0.64 0 32.914
3 3 1.488 8 1.175 0 21.0349
4 4 2.296 8 1.908 0 16.899
5 5 3.042 8 2.653 0 12.7876
6 6 3.713 8 3.26 0 12.2004
7 7 5.312 8 4.774 0 10.128
8 8 5.923 8 5.147 0 13.1015
9 9 7.25 8 6.683 0 7.82069
10 10 9.166 8 8.43 0 8.02967

Notice that we have a noticeable gain with small matrices without any allocation (my use case).

@mateuszbaran If I can improve the algorithm to work for all matrix sizes, would this approach be useful for StaticArrays.jl?

For my project, I will implement this type of function to compute the pinv in my case.

mateuszbaran commented 4 months ago

@mateuszbaran If I can improve the algorithm to work for all matrix sizes, would this approach be useful for StaticArrays.jl?

Hmm, that's a significant amount of additional code with relatively limited gains. I'd prefer to have a second opinion if it's worth it.

Directly calling BLAS would have to be done a bit more carefully than your blas_svd. First, I would limit it to blas_svd(A::StaticMatrix{N, M, Float64}).

Call the function to obtain the optimum size of the work array.

Querying doesn't make sense if you don't allocate the requested amount of memory on the heap. You have a fixed size there (1000 elements) and that what you have to provide. I'd suggest, though, a different approach: using a formula from LAPACK docs (MAX(3MIN(M,N)+MAX(M,N),5MIN(M,N)) or something larger), see here: https://docs.oracle.com/cd/E19422-01/819-3691/dgesvd.html .

I'd also suggest using LinearAlgebra.LAPACK.chklapackerror to detect any possible errors.

ronisbr commented 4 months ago

Hi @mateuszbaran!

Thanks for the tips.

Hmm, that's a significant amount of additional code with relatively limited gains. I'd prefer to have a second opinion if it's worth it.

I see. Maybe I will create a package called StaticArrayBlasInterface to add those algorithms (important to my use) and after we can decide if we migrate them to here.

ronisbr commented 4 months ago

Hi @mateuszbaran!

I finished the initial code of the package (still needs polishing and error verification):

https://github.com/ronisbr/StaticArraysBlasInterfaces.jl

It is working perfectly for Float64 and Float32 matrices. I overloaded only the svd function, but it also provided allocation-free pinv (because computing the SVD was the only part in pinv that allocates).

I also opened a thread in Discourse to discuss more about the implementation:

https://discourse.julialang.org/t/package-to-avoid-allocations-in-some-functions-when-using-staticarrays-jl/114539

Thanks!

mateuszbaran commented 4 months ago

Hi! Making a new package is a good solution for now. I'm not knowledgeable about every aspect of StaticArrays and linear algebra and maintainer availability here is sometimes less then ideal, so working around it is necessary sometimes.

fredrikekre commented 4 months ago

I think it is fine to include it here. The BLAS interface file in LinearAlgebra have basically not changed since the start and I expect the same for this new code so shouldn't require any maintenance. It is a strict improvement and StaticArrays already depend on LinearAlgebra too.

mateuszbaran commented 4 months ago

OK, then let's put it in StaticArrays.jl.

fredrikekre commented 4 months ago

Would be good to at least measure the impact on using StaticArrays with/without the additional code though.

ronisbr commented 4 months ago

Perfect! I will submit the PR and perform all the analysis.

ronisbr commented 4 months ago

Done @mateuszbaran and @fredrikekre ! I submitted the PR for review. If it got merged, I can also implement the same approach for other functions that allocates due to call functions in Julia Base (like qr).

ronisbr commented 4 months ago

Thank you very much @mateuszbaran and @fredrikekre for the help adding this feature to StaticArrays.jl :)

mateuszbaran commented 4 months ago

Thank you for your contribution :slightly_smiling_face: