Make cg work on GPU and throw informative errors when about to segfault

mohamed82008 commented 5 years ago

So turns out IterativeSolvers.cg mostly works on the GPU, but needed a few corner cases handled to make it more friendly. This PR hopefully does this. With this PR, this works:

using LinearAlgebra, CuArrays, IterativeSolvers

n = 100
A = CuArray(rand(n, n)); A = A + A' + 2*n*I
b = CuArray(rand(n))
x = cg(A, b)

IterativeSolvers.cg! already worked without this PR when all of x, A and b are on the GPU. If one of them is not, Julia segfaults. I did the check for x and b to give a friendly error, however since A can be a linear operator, I left it unchecked. Feedback appreciated!

chriscoey commented 5 years ago

Nice! Any observations from performance comparisons (cg on gpu vs cpu)?

mohamed82008 commented 5 years ago

using LinearAlgebra, CuArrays, IterativeSolvers
function f()
    for T in (Float32, Float64)
        for n in (200, 1000, 5000)
            A = rand(T, n, n)
            b = rand(T, n)
            c = similar(b) .= 0
            if n == 200
                cg!(c, A, b)
                c .= 0
            end
            t = @elapsed cg!(c, A, b)
            println("CPU: T = $T, n = $n, t = $t")
        end

        for n in (200, 1000, 5000)
            A = CuArray(rand(T, n, n))
            b = CuArray(rand(T, n))
            c = similar(b) .= 0
            if n == 200
                cg!(c, A, b)
                c .= 0
            end
            t = @elapsed cg!(c, A, b)
            println("GPU: T = $T, n = $n, t = $t")
        end
    end
end
f()
#=
CPU: T = Float32, n = 200, t = 0.006496639
CPU: T = Float32, n = 1000, t = 0.073049358
CPU: T = Float32, n = 5000, t = 36.184316022

GPU: T = Float32, n = 200, t = 0.043137831
GPU: T = Float32, n = 1000, t = 0.252440844
GPU: T = Float32, n = 5000, t = 8.522336734

CPU: T = Float64, n = 200, t = 0.006768321
CPU: T = Float64, n = 1000, t = 0.255374633
CPU: T = Float64, n = 5000, t = 57.059444524

GPU: T = Float64, n = 200, t = 0.053543029
GPU: T = Float64, n = 1000, t = 0.400916859
GPU: T = Float64, n = 5000, t = 16.084756801
=#

mohamed82008 commented 5 years ago

Not to mention DistributedArrays.jl also just works :)

ChrisRackauckas commented 5 years ago

What about GMRES :)

mohamed82008 commented 5 years ago

What about GMRES :)

If I can make LOBPCG work, I can probably make any of them work :) The main challenge is in avoiding non-contiguous views like https://github.com/JuliaMath/IterativeSolvers.jl/blob/01af27fe76069e6be566abfd862664f928997638/src/gmres.jl#L246. The only workaround I can think of is using a contiguous view of a GPUVector reshaped into a GPUMatrix, that works. But I will need to be careful to make sure no undesirable side-effects happen, e.g. overwriting wrong parts of the "matrix" that I shouldn't. If that's a problem, I can just make the vector another field and copy what I need from the matrix using broadcasts with getindex and setindex! but this will increase the memory. Anyways, GMRES seems a lot easier than LOBPCG so let's see.

haampie commented 5 years ago

In GMRES you should keep the (low-dimensional) Hessenberg matrix on the CPU, the Krylov basis should go to the GPU. I made that contiguous-view-PR in CuArrays actually exactly because I wanted to implement GMRES on the GPU ^^

Basically for all these projection methods there is some high-dimensional stuff with simple number crunching that can go to the GPU, and some low-dimensional stuff where you have to do LAPACK style linalg, so that is fit for the CPU. (Also with distributed arrays you want each core to do the low-dimensional stuff locally, and the high-dimensional stuff distributed)

mohamed82008 commented 5 years ago

Basically for all these projection methods there is some high-dimensional stuff with simple number crunching that can go to the GPU, and some low-dimensional stuff where you have to do LAPACK style linalg, so that is fit for the CPU. (Also with distributed arrays you want each core to do the low-dimensional stuff locally, and the high-dimensional stuff distributed)

The distributed case is more complicated and would probably need a separate treatment, it just working for cg is just a happy extra nugget. As for the small and large number crunching, it may be worth benchmarking both versions. Initiating a data transfer between the CPU and GPU is not trivial and may end up outweighing any benefits from performing the small number crunching on the CPU.

amontoison commented 5 years ago

If you implement V::Matrix{T} and H::Matrix{T} as Vector{Vector{T}} it can't help you?

I did that for DQGMRES in the Krylov.jl repository to avoid the view which use 48 bytes each time... And also to be able to use BLAS.dot with the column of H.

It's a good idea to have implementations that work on GPU / CPU. I should test it too with our methods.

mohamed82008 commented 5 years ago

I went with @haampie's suggestion of doing the small computations on the CPU and the big ones on GPU because of 2 reasons:

It is easier to implement it this way, and
If we use a DistributedArray, the small computations should be done on a single machine, or it will be extremely slow.

I still suspect having the whole thing on the GPU might be faster because of communication overhead but for sake of time and usability with DistributedArrays, I went with @haampie's approach in the LOBPCG PR.

mohamed82008 commented 5 years ago

It seems that the speedup is pretty good for CSR sparse matrices as well:

using Random, SparseArrays, LinearAlgebra, CuArrays, CuArrays.CUSPARSE, IterativeSolvers

CuArrays.allowscalar(false)
seed = 123321

tol = 1e-4
npercol = 10

ns = (10_000, 100_000, 1_000_000, 5_000_000)

function getmatrix(::Type{T}, n, npercol) where {T}
    A = sprand(T, n, n, (npercol - 1)/2/n)
    A = A + A' + 2*n*I
    A
end

function f()
    global seed, tol, ns, npercol
    T = Float64
    Random.seed!(seed)
    for n in ns
        A = getmatrix(T, n, npercol)
        b = rand(T, n)
        x = rand(T, n)
        statevars = CGStateVariables(similar(x), similar(x), similar(x))
        if n == ns[1]
        cg!(x, A, b, statevars=statevars, maxiter=n÷2, tol=tol)
         x = rand(T, n)
        end
        t = @elapsed cg!(x, A, b, statevars=statevars, maxiter=n÷2, tol=tol)
        println("CPU: T = $T, n = $n, t = $t")
    end

    Random.seed!(seed)
    for n in ns
        A = CuSparseMatrixCSR(getmatrix(T, n, npercol))
        b = CuArray(rand(T, n))
        x = CuArray(rand(T, n))
        statevars = CGStateVariables(similar(x), similar(x), similar(x))
        if n == ns[1]
        cg!(x, A, b, statevars=statevars, maxiter=n÷2, tol=tol)
        x = CuArray(rand(T, n))
        end
        t = @elapsed cg!(x, A, b, statevars=statevars, maxiter=n÷2, tol=tol)
        println("GPU: T = $T, n = $n, t = $t")
    end
end
f()

#=
CPU: T = Float64, n = 10000, t = 0.001078335
CPU: T = Float64, n = 100000, t = 0.011978747
CPU: T = Float64, n = 1000000, t = 0.250723691
CPU: T = Float64, n = 5000000, t = 2.267166389

GPU: T = Float64, n = 10000, t = 0.001863109
GPU: T = Float64, n = 100000, t = 0.003564305
GPU: T = Float64, n = 1000000, t = 0.042549204
GPU: T = Float64, n = 5000000, t = 0.221863912
=#

haampie commented 5 years ago

Sorry for being so late to handle this pr. Thinking about this a bit more: I'm not a big fan of making this package depend on GPUArrays, so my suggestion would be to have an example in the docs or maybe in the docstring how to use it with GPUArrays. The fact that Julia segfaults with mixed vectors on CPU / GPU feels like an issue that should be fixed upstream in GPUArrays, no?

JuliaLinearAlgebra / IterativeSolvers.jl

Make cg work on GPU and throw informative errors when about to segfault #227