Minimal changes to make LOBPCG work on the GPU

mohamed82008 commented 5 years ago

This PR makes the least amount of changes to make the LOBPCG algorithm work on GPUArrays. With this PR, the following just works:

using LinearAlgebra, CuArrays, IterativeSolvers

n = 1000;
A = CuArray(rand(n, n)); A = A + A' + 2*n*I;
X = CuArray(rand(n, 2));
iterator = LOBPCGIterator(A, true, X);
result = lobpcg!(iterator, maxiter=n÷2)
#=
Results of LOBPCG Algorithm
 * Algorithm: LOBPCG - CholQR
 * λ: [3000.625943844625,2025.7587087557938]
 * Residual norm(s): [1.0455519186709642e-5,1.9208507426245153e-5]
 * Convergence
   * Iterations: 98
   * Converged: true
   * Iterations limit: 500
=#

typeof(result.X)
#CuArray{Float64,2}

Notice that small matrix operations are all done on the CPU on purpose, and there are some unoptimized allocations.

Supporting DistributedArrays is also somewhat trivial if one extends the gmul functions for DistributedArrays, but this will mean that we have to depend on DistributedArrays. If there is a lightweight AbstractDistributedArrays package, that would be nice to use instead.

mohamed82008 commented 5 years ago

This PR extends #227.

mohamed82008 commented 5 years ago

Interestingly, I am getting a 10x slowdown on the GPU compared to the CPU, needs further investigation.

mohamed82008 commented 5 years ago

Much better now:

using Random, LinearAlgebra, CuArrays, IterativeSolvers

seed = 123321

tol = 1e-3

ns = (1000, 3000, 9000)

function getmatrix(::Type{T}, n) where {T}
    A = rand(T, n, n)
    A = A + A' + 2*n*I
    A
end

function f()
    global seed, tol, ns
    T = Float64
    Random.seed!(seed)
    for n in ns
        A = getmatrix(T, n)
        B = getmatrix(T, n)
        X = rand(T, n, 2)
        iterator = LOBPCGIterator(A, B, false, X)
        if n == ns[1]
            lobpcg!(iterator, maxiter = n ÷ 2, tol = tol)
            X = rand(T, n, 2)
            iterator = LOBPCGIterator(A, B, false, X)
        end
        t = @elapsed lobpcg!(iterator, maxiter = n ÷ 2, tol = tol)
        println("CPU: T = $T, n = $n, t = $t")
    end

    Random.seed!(seed)
    for n in ns
        A = CuArray(getmatrix(T, n))
        B = CuArray(getmatrix(T, n))
        X = CuArray(rand(T, n, 2))
        iterator = LOBPCGIterator(A, B, false, X)
        if n == ns[1]
            lobpcg!(iterator, maxiter = n ÷ 2, tol = tol)
            X = CuArray(rand(T, n, 2))
            iterator = LOBPCGIterator(A, B, false, X)
        end
        t = @elapsed lobpcg!(iterator, maxiter = n ÷ 2, tol = tol)
        println("GPU: T = $T, n = $n, t = $t")
    end
end
f()

#=
CPU: T = Float64, n = 1000, t = 0.060437162
CPU: T = Float64, n = 3000, t = 0.549240807
CPU: T = Float64, n = 9000, t = 11.224066776

GPU: T = Float64, n = 1000, t = 0.29401423
GPU: T = Float64, n = 3000, t = 0.671019518
GPU: T = Float64, n = 9000, t = 5.750453926
=#

haampie commented 5 years ago

Nice! I have little time to review, but might find some time wednesday

One question though: aren't you at this point benchmarking the performance of a dense matrix-vector product on the gpu vs the cpu? That's an O(n^2) operation that probably dominates all other costs. What if the matrix-vector product is also O(n) i.e. sufficiently sparse?

mohamed82008 commented 5 years ago

Oh well, this is true but the main purpose of this benchmark is to check that some of the speedup of GPU accelerating the dominating part of the algorithm is still observed given all the other costs we have to pay: data transfer cost, kernel calls, etc. Since nothing of order n is done on the CPU, the overhead should be constant, so even if I move to sparse arrays, which I will soon, as long as the multiplication still dominates the running time, and the GPU speeds up the multiplication, then we should be seeing a similar speedup, but we may have to use a larger n in that case. Notice that it is not just the matrix-vector multiplication done on the GPU, the gram matrices are also computed on the GPU which if a good number of eigenvectors is needed will also be a significant cost compared to pre-multiplying the block of vectors by a sparse matrix.

mohamed82008 commented 5 years ago

Seems that for a sparse matrix, the speedup is completely gone. Even more interestingly, the GPU runs out of memory way sooner than it should. I will need to make sure no conversion to a dense matrix is happening anywhere.

Also simple benchmarking of multiplication shows that a CSC sparse matrix vector multiplication is very slow on the GPU and much faster on the CPU, when the number of elements per column is constant ~40, and n = 10000. The CSR however is much faster, but only 2x faster on the GPU compared to the CPU. And even with the CSR, things are not looking good for IterativeSolvers. Note that the speedup in the dense case is close to a x1000 whereas we are only observing a x2 speedup here in LOBPCG and x4 in CG, so there is A LOT of performance left on the table. It may worth implementing CG and LOBPCG using nothing by CUDAnative to try to eliminate most kernel calls to functions which can be hand coded. I will play more with this when I get some more time.

haampie commented 5 years ago

Also note that the CPU version of mul! is not threaded. A threaded CSR mv-product can also be quite a bit faster on the CPU.

mohamed82008 commented 5 years ago

Ya overall, I think it is worth playing with the GPU LOBPCG in a separate package until I get it right, then we can port it here. I will close this PR for now

JuliaLinearAlgebra / IterativeSolvers.jl

Minimal changes to make LOBPCG work on the GPU #228