Closed lruthotto closed 5 years ago
That is a quite large speedup. We should try to improve here. Extracting the factors like you do here is one option but it might be faster to use umfpack_dl_wsolve
. The documentation states that
When you have many linear systems to solve, this routine is faster than
umfpack_*_solve, since the workspace (Wi, W) needs to be allocated only
once, prior to calling umfpack_*_wsolve.
sounds interesting, but I don't know umfpack_dl_wsolve
. Is that a Julia function? If you code an example, I'd be happy to run some comparisons. Note that the version I posted up here also works for sparse right hand sides. Do you know if that is also the case in umfpack_dl_wsolve
?
On the same note. A big bottleneck still are the triangular solves (see fwdTriSolve!
and bwdTriSolve!
in linalg.jl). Is there any way to parallelize them for multiple right hand sides? Would shared array work?
I've just taken a look at this and the reason that the default solve is slow is because iterative refinements of the solutions are enabled by default in UMFPACK. You can try it out with
SparseArrays.UMFPACK.umf_ctrl[8] = 0
I get
julia solve
julia> @time t1 = luA\rhs
23.085403 seconds (28.54 k allocations: 184.735 MB, 0.58% gc time)
julia> @time t2 = myLUsolve(luA,rhs)
time to extract LUfactors
1.528154 seconds (7.95 k allocations: 548.319 MB, 8.36% gc time)
time to solve for 300 right hand sides
13.622108 seconds (504.26 k allocations: 136.302 MB, 0.96% gc time)
15.343389 seconds (762.59 k allocations: 696.246 MB, 1.72% gc time)
julia> SparseArrays.UMFPACK.umf_ctrl[8] = 0
0
julia> @time t3 = luA\rhs;
7.729600 seconds (609 allocations: 91.562 MB, 0.34% gc time)
We might want to expose this through a function. I also tried umfpack_dl_wsolve
but the allocation time is insignificant so it doesn't really change anything.
Thanks for the hint. I see a similar speedup now when deactivating iterative refinements. I do see additional room for speedup here (or elsewhere) though if we could have an OpenMP type parallelism for the upper- and lower-triangular solves. UMFpack seems to be using only one thread on my machine on the solve stage.
We kind of have that option with Threads.@threads
. The function mysolve2
does a @threads
for
loop over the columns of rhs
julia> @time MyUMF.mysolve(luA, rhs);
7.381999 seconds (610 allocations: 23.374 MB, 0.02% gc time)
julia> @time MyUMF.mysolve2(luA, rhs);
2.049506 seconds (1.64 k allocations: 142.510 MB, 3.53% gc time)
However, it looks like some of the functions in the UMFPACK
have a too restrictive signature since they don't allow for view
s as input. We should change that. You should also be a bit careful when using Julia's threads because more than a single level of @threads
for
loops is not working.
Thanks for the hint. I looked at threads now. This would be a great addition to many of our codes. However, so far it does not seem to work well. There is a lot of allocation added also for my example (that doesnt involve UMF)
See this example:
println("Number Of Threads : $(Threads.nthreads())")
function fwdTriSolve2!(A::SparseMatrixCSC, B::AbstractVecOrMat)
# forward substitution for CSC matrices
nrowB, ncolB = size(B, 1), size(B, 2)
ncol = LinAlg.checksquare(A)
if nrowB != ncol
throw(DimensionMismatch("A is $(ncol) columns and B has $(nrowB) rows"))
end
aa = A.nzval
ja = A.rowval
ia = A.colptr
Threads.@threads for k = 1:ncolB
for j = 1:nrowB
i1 = ia[j]
i2 = ia[j + 1] - 1
# loop through the structural zeros
ii = i1
jai = ja[ii]
while ii <= i2 && jai < j
ii += 1
jai = ja[ii]
end
# check for zero pivot and divide with pivot
if jai == j
bj = B[ jai,k]/aa[ii]
B[jai,k] = bj
ii += 1
else
throw(LinAlg.SingularException(j))
end
# update remaining part
for i = ii:i2
B[ ja[i],k] -= bj*aa[i]
end
end
end
B
end
A = tril(sprandn(10000,10000,0.00002) + UniformScaling(100))
rhs = randn(size(A,1),20)
t1 = copy(rhs)
t2 = copy(rhs)
println("no-thread")
@time t1 = A\t1
println("with thread")
@time t2 = fwdTriSolve2!(A,t2)
println("\nerr: $(norm(t1-t2)./norm(t1))")
Running this (and repeating a couple of times) I get this:
no-thread
0.002094 seconds (7 allocations: 1.526 MB)
with thread
0.131745 seconds (1.25 M allocations: 20.965 MB)
Try
function fwdTriSolve2!(A::SparseMatrixCSC, B::AbstractVecOrMat)
# forward substitution for CSC matrices
nrowB, ncolB = size(B, 1), size(B, 2)
ncol = LinAlg.checksquare(A)
if nrowB != ncol
throw(DimensionMismatch("A is $(ncol) columns and B has $(nrowB) rows"))
end
aa = A.nzval
ja = A.rowval
ia = A.colptr
Threads.@threads for k = 1:ncolB
do_stuff(aa, ja, ia, k, nrowB)
end
B
end
function do_stuff(aa, ja, ia, k, nrowB)
for j = 1:nrowB
i1 = ia[j]
i2 = ia[j + 1] - 1
# loop through the structural zeros
ii = i1
jai = ja[ii]
while ii <= i2 && jai < j
ii += 1
jai = ja[ii]
end
# check for zero pivot and divide with pivot
if jai == j
bj = B[ jai,k]/aa[ii]
B[jai,k] = bj
ii += 1
else
throw(LinAlg.SingularException(j))
end
# update remaining part
for i = ii:i2
B[ ja[i],k] -= bj*aa[i]
end
end
end
julia> @time t2 = fwdTriSolve2!(A,t2);
0.000126 seconds (17 allocations: 416 bytes)
Thanks for this suggestion. I tried it, but the result was incorrect. I assume you want to provide B
also to do_stuff
. Doing this, the result is correct, but the timing is less impressive. For 4 threads I'm getting
no-thread
1.419104 seconds (7 allocations: 152.588 MB, 0.86% gc time)
with thread
0.690881 seconds (19.85 k allocations: 153.461 MB, 10.19% gc time)
So, it's much better than what I previous had.
Yeah, I messed up, but the point was to put everything inside the thread macro in its own function.
Short term it is worthwhile delving into the internals of UMFPack and SuiteSparse in general to see how to speed things up. But long term, for both the license implications and for the sanity of those trying to interface to SuiteSparse, it is better to create an alternative implementation in Julia.
I've opened https://github.com/JuliaLang/julia/pull/19511 to make it easier to use threads for this. With that change, you can just use A_ldiv_B!
and a threaded loop over the columns of B
. E.g.
function mysolve{T}(F::UmfpackLU{T}, B::Matrix{T})
n = checksquare(F)
n == size(B, 1) || error("")
X = similar(B)
@threads for j in 1:size(B,2)
A_ldiv_B!(view(X,:,j), F, view(B,:,j))
end
return X
end
I don't think extracting the factors of the LU will give a speed up.
Please reopen if something further can be done here.
The LU factorization (through UMFpack) currently implemented in julia 0.5 is rather slow when using it for multiple right hand sides. Looking at umfpack.jl lines 259 you can see that there is a loop over all right hand sides.
As far as I could see UMFpack does not seem to have a built-in option for multiple rhs. So, i followed the MATLAB code here to build one.
It works quite nicely and I'd be pleased to contribute that to Base. To this end, it would be great to get some help and advice. For example, I'm not sure where to put the code. Also, I'm not sure why it takes so long to get the parts of the lufactorization.
See this example here:
On my machine, Ubuntu 14, julia 0.5.1, I get the following timings