Speed up UMFpack LU for multiple (sparse) right hand sides

lruthotto commented 7 years ago

The LU factorization (through UMFpack) currently implemented in julia 0.5 is rather slow when using it for multiple right hand sides. Looking at umfpack.jl lines 259 you can see that there is a loop over all right hand sides.

As far as I could see UMFpack does not seem to have a built-in option for multiple rhs. So, i followed the MATLAB code here to build one.

It works quite nicely and I'd be pleased to contribute that to Base. To this end, it would be great to get some help and advice. For example, I'm not sure where to put the code. Also, I'm not sure why it takes so long to get the parts of the lufactorization.

See this example here:

function myLUsolve(A::Base.SparseArrays.UMFPACK.UmfpackLU,B)

    # get parts of LUfactorization
    println("time to extract LUfactors")
    @time (L,U,p,q,R) = A[:(:)];

    println("time to solve for $(size(B,2)) right hand sides")
    @time begin
        X = zeros(eltype(B),size(B))
        X[q,:] = U\(L\((Diagonal(R) * B)[p,:]))
    end
    return X
end

nrhs = 300
n    = 10000
A    = UniformScaling(10) + sprandn(n,n,0.0005);
rhs  = randn(n,nrhs);

println("=== lufact(A)\\rhs for  real matrix ===")
println("\t factorize")
@time luA = lufact(A)
println("\t julia solve")
@time t1 =  luA\rhs
println("\t my solve")
@time t2 = myLUsolve(luA,rhs)
println("err: $(norm(t1-t2)/norm(t1))")

On my machine, Ubuntu 14, julia 0.5.1, I get the following timings

=== lufact(A)\rhs for  real matrix ===
     factorize
  5.479585 seconds (66 allocations: 536.308 MB, 0.64% gc time)
     julia solve
 29.411826 seconds (613 allocations: 183.115 MB, 0.82% gc time)
     my solve
time to extract LUfactors
  1.517049 seconds (42 allocations: 562.236 MB, 5.83% gc time)
time to solve for 300 right hand sides
 14.070643 seconds (31 allocations: 114.442 MB, 0.96% gc time)
 15.619639 seconds (7.67 k allocations: 677.021 MB, 1.44% gc time)
err: 1.5411555266919556e-12

andreasnoack commented 7 years ago

That is a quite large speedup. We should try to improve here. Extracting the factors like you do here is one option but it might be faster to use umfpack_dl_wsolve. The documentation states that

When you have many linear systems to solve, this routine is faster than
    umfpack_*_solve, since the workspace (Wi, W) needs to be allocated only
    once, prior to calling umfpack_*_wsolve.

lruthotto commented 7 years ago

sounds interesting, but I don't know umfpack_dl_wsolve. Is that a Julia function? If you code an example, I'd be happy to run some comparisons. Note that the version I posted up here also works for sparse right hand sides. Do you know if that is also the case in umfpack_dl_wsolve?

lruthotto commented 7 years ago

On the same note. A big bottleneck still are the triangular solves (see fwdTriSolve! and bwdTriSolve! in linalg.jl). Is there any way to parallelize them for multiple right hand sides? Would shared array work?

andreasnoack commented 7 years ago

I've just taken a look at this and the reason that the default solve is slow is because iterative refinements of the solutions are enabled by default in UMFPACK. You can try it out with

SparseArrays.UMFPACK.umf_ctrl[8] = 0

I get

  julia solve
julia> @time t1 =  luA\rhs
 23.085403 seconds (28.54 k allocations: 184.735 MB, 0.58% gc time)
julia> @time t2 = myLUsolve(luA,rhs)
time to extract LUfactors
  1.528154 seconds (7.95 k allocations: 548.319 MB, 8.36% gc time)
time to solve for 300 right hand sides
 13.622108 seconds (504.26 k allocations: 136.302 MB, 0.96% gc time)
 15.343389 seconds (762.59 k allocations: 696.246 MB, 1.72% gc time)
julia> SparseArrays.UMFPACK.umf_ctrl[8] = 0
0

julia> @time t3 =  luA\rhs;
  7.729600 seconds (609 allocations: 91.562 MB, 0.34% gc time)

We might want to expose this through a function. I also tried umfpack_dl_wsolve but the allocation time is insignificant so it doesn't really change anything.

lruthotto commented 7 years ago

Thanks for the hint. I see a similar speedup now when deactivating iterative refinements. I do see additional room for speedup here (or elsewhere) though if we could have an OpenMP type parallelism for the upper- and lower-triangular solves. UMFpack seems to be using only one thread on my machine on the solve stage.

andreasnoack commented 7 years ago

We kind of have that option with Threads.@threads. The function mysolve2 does a @threads for loop over the columns of rhs

julia> @time MyUMF.mysolve(luA, rhs);
  7.381999 seconds (610 allocations: 23.374 MB, 0.02% gc time)

julia> @time MyUMF.mysolve2(luA, rhs);
  2.049506 seconds (1.64 k allocations: 142.510 MB, 3.53% gc time)

However, it looks like some of the functions in the UMFPACK have a too restrictive signature since they don't allow for views as input. We should change that. You should also be a bit careful when using Julia's threads because more than a single level of @threads for loops is not working.

lruthotto commented 7 years ago

Thanks for the hint. I looked at threads now. This would be a great addition to many of our codes. However, so far it does not seem to work well. There is a lot of allocation added also for my example (that doesnt involve UMF)

See this example:

println("Number Of Threads : $(Threads.nthreads())")

function fwdTriSolve2!(A::SparseMatrixCSC, B::AbstractVecOrMat)
# forward substitution for CSC matrices
    nrowB, ncolB  = size(B, 1), size(B, 2)
    ncol = LinAlg.checksquare(A)
    if nrowB != ncol
        throw(DimensionMismatch("A is $(ncol) columns and B has $(nrowB) rows"))
    end

    aa = A.nzval
    ja = A.rowval
    ia = A.colptr

    Threads.@threads for k = 1:ncolB
        for j = 1:nrowB
            i1 = ia[j]
            i2 = ia[j + 1] - 1

            # loop through the structural zeros
            ii = i1
            jai = ja[ii]
            while ii <= i2 && jai < j
                ii += 1
                jai = ja[ii]
            end

            # check for zero pivot and divide with pivot
            if jai == j
                bj = B[ jai,k]/aa[ii]
                B[jai,k] = bj
                ii += 1
            else
                throw(LinAlg.SingularException(j))
            end

            # update remaining part
            for i = ii:i2
                B[ ja[i],k] -= bj*aa[i]
            end
        end
    end
    B
end

A = tril(sprandn(10000,10000,0.00002) + UniformScaling(100))
rhs = randn(size(A,1),20)
t1 = copy(rhs)
t2 = copy(rhs)

println("no-thread")
@time t1 = A\t1

println("with thread")
@time t2 = fwdTriSolve2!(A,t2)

println("\nerr: $(norm(t1-t2)./norm(t1))")

Running this (and repeating a couple of times) I get this:

no-thread
  0.002094 seconds (7 allocations: 1.526 MB)
with thread
  0.131745 seconds (1.25 M allocations: 20.965 MB)

KristofferC commented 7 years ago

Try

function fwdTriSolve2!(A::SparseMatrixCSC, B::AbstractVecOrMat)
# forward substitution for CSC matrices
    nrowB, ncolB  = size(B, 1), size(B, 2)
    ncol = LinAlg.checksquare(A)
    if nrowB != ncol
        throw(DimensionMismatch("A is $(ncol) columns and B has $(nrowB) rows"))
    end

    aa = A.nzval
    ja = A.rowval
    ia = A.colptr

    Threads.@threads for k = 1:ncolB
        do_stuff(aa, ja, ia, k, nrowB)
    end
    B
end

function do_stuff(aa, ja, ia, k, nrowB)
    for j = 1:nrowB
        i1 = ia[j]
        i2 = ia[j + 1] - 1

        # loop through the structural zeros
        ii = i1
        jai = ja[ii]
        while ii <= i2 && jai < j
            ii += 1
            jai = ja[ii]
        end

        # check for zero pivot and divide with pivot
        if jai == j
            bj = B[ jai,k]/aa[ii]
            B[jai,k] = bj
            ii += 1
        else
            throw(LinAlg.SingularException(j))
        end

        # update remaining part
        for i = ii:i2
            B[ ja[i],k] -= bj*aa[i]
        end
    end
end

julia> @time t2 = fwdTriSolve2!(A,t2);
  0.000126 seconds (17 allocations: 416 bytes)

lruthotto commented 7 years ago

Thanks for this suggestion. I tried it, but the result was incorrect. I assume you want to provide B also to do_stuff. Doing this, the result is correct, but the timing is less impressive. For 4 threads I'm getting

no-thread
  1.419104 seconds (7 allocations: 152.588 MB, 0.86% gc time)
with thread
  0.690881 seconds (19.85 k allocations: 153.461 MB, 10.19% gc time)

So, it's much better than what I previous had.

KristofferC commented 7 years ago

Yeah, I messed up, but the point was to put everything inside the thread macro in its own function.

dmbates commented 7 years ago

Short term it is worthwhile delving into the internals of UMFPack and SuiteSparse in general to see how to speed things up. But long term, for both the license implications and for the sanity of those trying to interface to SuiteSparse, it is better to create an alternative implementation in Julia.

andreasnoack commented 7 years ago

I've opened https://github.com/JuliaLang/julia/pull/19511 to make it easier to use threads for this. With that change, you can just use A_ldiv_B! and a threaded loop over the columns of B. E.g.

function mysolve{T}(F::UmfpackLU{T}, B::Matrix{T})
    n = checksquare(F)
    n == size(B, 1) || error("")
    X = similar(B)
    @threads for j in 1:size(B,2)
        A_ldiv_B!(view(X,:,j), F, view(B,:,j))
    end
    return X
end

I don't think extracting the factors of the LU will give a speed up.

ViralBShah commented 5 years ago

Please reopen if something further can be done here.

JuliaLang / julia

Speed up UMFpack LU for multiple (sparse) right hand sides #19500