Closed andreasvarga closed 11 months ago
@andreasvarga do you have an example in mind and maybe the skeleton of an interface?
For me the question is if there is any way to reuse a certain data cache, provided as an array, to ovelap with a structure, as that resulting from the QR decomposition to avoid further allocations. I have no solution for using either qr!
or LinearAlgebra.LAPACK.geqrf!
and preventing further allocations.
Is it something like that you have in mind?
A = randn(4, 4)
B = randn(5, 5)
ws1 = QRWs(A)
ws2 = QRWs(B, cache=ws1.work)
for i = 1:10
qr1 = LAPACK.geqrf!(ws1, A)
...
qr2 = LAPACK.geqrf!(ws2, A)
...
end
That shouldn't be too hard to implement
The concrete case is the for loop in:
where the functions qr!
and lmul!
are involved. For large values of p
(e.g., p >= 100
), there is a large number of allocations already in this loop. However, this function is called by the periodic Lyapunov equation solver, which in turn is called by an optimization routine for function/gradient evaluations. So, altogether, there is huge amount of allocated space, which would be desirable to be minimized/eliminated.
To simplify things, consider the following simple case. I would like to find an equivalent computational scheme to the following sequence between begin ... end
which performs NO allocations:
julia> using LinearAlgebra
julia> uu = rand(8,4); uk = rand(8,4); ud = rand(4,4);
julia> @allocated begin F = qr!(uu); copyto!(ud,F.R); lmul!(F.Q',uk); end
864
I wonder if this is possible at all!
Now I understand. There is a problem in our example here https://dynarejulia.github.io/FastLapackInterface.jl/dev/workspaces/#QR-id
LAPACK.geqrf!(ws, A)
allocates only 32 bits but QR(LAPACK.geqrf!(ws, A)...)
allocates a lot more
I will see what I can do
Here is a way to do what you want:
using FastLapackInterface
qr_ws = QRWs(uu)
ormqr_ws = QROrmWs(qr_ws, 'L', 'T', uu, zeros(8, 4))
function set_R!(R::Matrix, U::Matrix)
m, n = size(U)
p = LinearAlgebra.checksquare(R)
@assert p == min(m, n)
fill!(R, 0.0)
for i in 1:p
for j in 1:i
R[j, i] = U[j, i]
end
end
return nothing
end
set_R!(ud, uu)
@allocated begin
LAPACK.geqrf!(qr_ws, uu)
set_R!(ud, uu)
LAPACK.ormqr!(ormqr_ws, 'L', 'T', uu, uk)
end
``
For some reasons I get an error when executing:
julia> ormqr_ws = QROrmWs(qr_ws, 'L', 'T', uu, zeros(8, 4))
ERROR: UndefVarError: `QROrmWs` not defined
Stacktrace:
[1] top-level scope
@ REPL[12]:1
The same with:
julia> ormqr_ws = FastLapackInterface.QROrmWs(qr_ws, 'L', 'T', uu, zeros(8, 4))
ERROR: UndefVarError: `QROrmWs` not defined
Stacktrace:
[1] getproperty(x::Module, f::Symbol)
@ Base .\Base.jl:31
[2] top-level scope
@ REPL[11]:1
Just for check, I have
julia> qr_ws
QRWs{Float64}
work: 128-element Vector{Float64}
τ: 4-element Vector{Float64}
Are you sure that you have FastLapackInterface v2.0 ?
Check please that QROrmWs
is defined on line 396 or qr.jl
Note also that the O of QROmWs
is the letter O and not the the number 0. It isn't easy to see the difference on GitHub website
I updated to v2.0.
I obtained
@allocated begin
LAPACK.geqrf!(qr_ws, uu)
set_R!(ud, uu)
LAPACK.ormqr!(ormqr_ws, 'L', 'T', uu, uk)
end
32
because of
julia> @allocated LAPACK.geqrf!(qr_ws, uu)
32
so not quite 0 allocations.
Yes. I don't know where the 32 bits come from. I will have to spend more time on it.
I was looking at the figures for the allocated work spaces (see below):
julia> qr_ws = QRWs(uu)
QRWs{Float64}
work: 128-element Vector{Float64}
τ: 4-element Vector{Float64}
julia> ormqr_ws = QROrmWs(qr_ws, 'L', 'T', uu, zeros(8,4))
QROrmWs{Float64}
work: 4288-element Vector{Float64}
τ: 4-element Vector{Float64}
Somehow, the space allocated with QROrmWs
seems to me being exagerated. Even the following is somehow curious:
julia> ormqr_ws = QROrmWs(qr_ws, 'L', 'T', uu, zeros(8,0))
QROrmWs{Float64}
work: 4192-element Vector{Float64}
τ: 4-element Vector{Float64}
In the LAPACK routine dormqr
there is a minimum workspace for LWORK:
LWORK is INTEGER
The dimension of the array WORK.
If SIDE = 'L', LWORK >= max(1,N);
if SIDE = 'R', LWORK >= max(1,M).
For good performance, LWORK should generally be larger.
Here is a more extreme case:
julia> qr_ws = QRWs([1.;;])
QRWs{Float64}
work: 32-element Vector{Float64}
τ: 1-element Vector{Float64}
julia> ormqr_ws = QROrmWs(qr_ws, 'L', 'T', [1.;;], [1.;;])
QROrmWs{Float64}
work: 4192-element Vector{Float64}
τ: 1-element Vector{Float64}
Would it be possible to optionally enforce the minimum amount of workspace allocated by QROrmWs
?
I implemented the above mentioned for loop
The concrete case is the for loop in:
using the scheme you proposed and obtained no allocations! This is much better than the original code!
Great. I'm glad it is useful
/@andreasvarga writes
julia> qr_ws = QRWs([1.;;]) QRWs{Float64} work: 32-element Vector{Float64} τ: 1-element Vector{Float64}
julia> ormqr_ws = QROrmWs(qr_ws, 'L', 'T', [1.;;], [1.;;]) QROrmWs{Float64} work: 4192-element Vector{Float64} τ: 1-element Vector{Float64}
Would it be possible to optionally enforce the minimum amount of workspace allocated by `QROrmWs`? -----------------
The required size is computed by the Lapack function with info=-1. The large size is computed bu OpenBlas. If one uses MKL, the size is much smaller:
julia> qr_ws = QRWs([1.;;])
QRWs{Float64}
work: 1-element Vector{Float64}
τ: 1-element Vector{Float64}
julia> ormqr_ws = QROrmWs(qr_ws, 'L', 'T', [1.;;], [1.;;])
BlasInt(real(ormws.work[1])) = 1
QROrmWs{Float64}
work: 1-element Vector{Float64}
τ: 1-element Vector{Float64}
With FastLapackInterface this allocation happens rarely in comparison with the number of time ormqr!
is used inside a loop. So maybe we can live with this slight inefficiency. I can't imagine a scenario where one would make hundreds of allocations with QROmrWs
With FastLapackInterface this allocation happens rarely in comparison with the number of time
ormqr!
is used inside a loop. So maybe we can live with this slight inefficiency. I can't imagine a scenario where one would make hundreds of allocations with QROmrWs
Unfortunately, this is exacly my case. The routine where the allocations of workspaces occur solves an n1xn2 periodic Sylvester equation with n1 = 1 or 2 and n2 = 1 or 2. This routine is called several times in a periodic Lyapunov solver, where n1 and n2 are the dimensions of diagonal blocks in a periodic Schur form of a product of N matrices (N usually larger than 100). If K is the number of diagonal blocks, this implies K*(K+1)/2 times the workspace allocation (in my problem K = 3, with two 1x1 blocks and one 2x2 block), The periodic Lyapunov solver is called to evaluate the function and gradient, which requires to solve my problem with Optim.jl
:
Iterations: 436
f(x) calls: 1161
∇f(x) calls: 1161
So a typical run involves more than 6*1161 times the allocations of work spaces. Thanks Julia, the computations are still very fast for my problem, and these are the final timing and allocation figures:
3.785461 seconds (5.36 M allocations: 1.297 GiB, 3.11% gc time)
Before starting with the reduction of allocations, all figures where 10 times larger. I think still there is room to reduce the number of allocations!
BTW: I tried instead
qr_ws = QRWs(uu)
ormqr_ws = QROrmWs(qr_ws, 'L', 'T', uu, uu)
to allocate directly the minimum necessary storage with
qr_ws = QRWs(zeros(n12), zeros(n12))
ormqr_ws = QROrmWs(zeros(n12), zeros(n12))
where n12 = n1*n2. But, for some reasons, this doesn't work. Any idea what could be the cause?
Unfortunately, this is exacly my case. The routine where the allocations of workspaces occur solves an n1xn2 periodic Sylvester equation with n1 = 1 or 2 and n2 = 1 or 2. This routine is called several times in a periodic Lyapunov solver, where n1 and n2 are the dimensions of diagonal blocks in a periodic Schur form of a product of N matrices (N usually larger than 100). If K is the number of diagonal blocks, this implies K*(K+1)/2 times the workspace allocation (in my problem K = 3, with two 1x1 blocks and one 2x2 block), The periodic Lyapunov solver is called to evaluate the function and gradient, which requires to solve my problem with Optim.jl:
One radical solution would be to instantiate qr_ws
and ormqr_ws
before calling Optim
and passing it to the objective function and functions downstream. That is what we tried to do in Dynare.jl
But given the weird initialization proposed by OpenBlas
I can see the need for choosing optionally for the minimum allocation instead of the socalled optimal one.
BTW: I tried instead
qr_ws = QRWs(uu) ormqr_ws = QROrmWs(qr_ws, 'L', 'T', uu, uu)
to allocate directly the minimum necessary storage with
qr_ws = QRWs(zeros(n12), zeros(n12)) ormqr_ws = QROrmWs(zeros(n12), zeros(n12))
where n12 = n1*n2. But, for some reasons, this doesn't work. Any idea what could be the cause?
Can you please post an example that isn't working?
With the first approach I am getting the following results at different calls
t = @allocated begin
qr_ws = QRWs(uu)
ormqr_ws = QROrmWs(qr_ws, 'L', 'T', uu, uu)
end
@show t, size(uu)
(t, size(uu)) = (35728, (8, 4))
(t, size(uu)) = (34672, (4, 2))
(t, size(uu)) = (34064, (2, 1))
With the direct allocation approach I am getting
t = @allocated begin
qr_ws = QRWs(zeros(n22), zeros(n12))
ormqr_ws = QROrmWs(zeros(n12), zeros(n12))
end
@show t, n12, n22
(t, n12, n22) = (416, 4, 8)
and the calling routine cycles indefinitely. So, something is wrong behind the scenes, but I have no idea how to debug.
Therefore, I would like to know if this second approach is in principle feasible or there are restrictions for its use?
What is the calling routine that cycles indefinitely? I want to be able to replicate the problem on my computer
I don't think it will be possible to reproduce the error on your machine. I am trying to detect the place where the differences between the correct run and the new one manifest. I hope to find a non-working example.
A suggestion: you know it runs when you allocate ws and ormqr_ws in a function just before the loop. So first, allocate ws and ormqr_ws in the function calling that function and test again. By allocating ws and ormqr_ws one step higher up every time in the chain of function calls, you may find where does the problem start
Here is an example which, in my opinion, should work with minimum amount of allocated storage. The first example executes correctly:
using LinearAlgebra, FastLapackInterface
uu = rand(8,4);
qr_ws = QRWs(uu)
ormqr_ws = QROrmWs(qr_ws, 'L', 'T', uu, uu)
UU = copy(uu)
F = qr!(UU)
LAPACK.geqrf!(qr_ws, uu)
@show norm(uu-UU)
zmi = rand(8,4);
TT = copy(zmi)
lmul!(F.Q',TT)
LAPACK.ormqr!(ormqr_ws, 'L', 'T', uu, zmi)
@show norm(zmi-TT)
julia> @show norm(zmi-TT)
norm(zmi - TT) = 6.784521034741873e-16
6.784521034741873e-16
The example with minimal memory allocation fails in applying the transformation:
using LinearAlgebra, FastLapackInterface
uu = rand(8,4);
qr_ws = QRWs(zeros(8), zeros(4))
ormqr_ws = QROrmWs(zeros(4), zeros(4))
UU = copy(uu)
F = qr!(UU)
LAPACK.geqrf!(qr_ws, uu)
@show norm(uu-UU)
zmi = rand(8,4);
TT = copy(zmi)
lmul!(F.Q',TT)
LAPACK.ormqr!(ormqr_ws, 'L', 'T', uu, zmi)
@show norm(zmi-TT)
julia> @show norm(zmi-TT)
norm(zmi - TT) = 4.280785896462212
4.280785896462212
Thanks! I will look at it
I found it: you need to replace
ormqr_ws = QROrmWs(zeros(4), zeros(4))
by
ormqr_ws = QROrmWs(zeros(4), qr_ws.τ)
qr_ws
and ormqr_ws
share the same field τ
. See line 449 of qr.jl
3.785461 seconds (5.36 M allocations: 1.297 GiB, 3.11% gc time)
Here are the new figures with your correction:
3.901344 seconds (5.32 M allocations: 795.979 MiB, 2.69% gc time)
A net redution of about 500MiB! I will try to further polish my code to achieve the maximum reduction with help of FastLapackInterface
.
Many thanks for your support and the invested time.
An idea, which perhaps would be useful for small problems, is to provide minimum size workspaces in conjunction with wrappers to level 2 BLAS based LAPACK routines (e.g., in this case DGEQR2 and DORM2R). For small problems, there is no need to try first to use blocking based routines and allocate unused working spaces.
Just in case you would like to make available in your package the function set_R!
, here is a more general version, which is fully (I hope) compatible with qr!(A)
(also for A
a wide matrix)
function set_R!(R::AbstractMatrix, U::AbstractMatrix)
m, n = size(U)
if m >= n
p = LinearAlgebra.checksquare(R)
@assert p == n
pm = n
else
p = m
@assert (m, n) == size(R)
pm = n
end
ZERO = zero(eltype(R))
#fill!(R, 0.0)
for j in 1:p
R[j,j] = U[j,j]
for i in 1:j-1
R[i,j] = U[i,j]
R[j,i] = ZERO
end
end
pm > p && copyto!(view(R,:,p+1:n),view(U,:,p+1:n))
return nothing
end
I finished polishing the codes. The last figures for a typical execution are quite satisfactory (see previous figures above):
3.468958 seconds (476.60 k allocations: 147.722 MiB, 0.35% gc time)
I will close this issue. Once again many thanks for your support.
A functionality which would be useful is to allow to reuse storage already allocated for other purposess.