Closed mohamed82008 closed 5 years ago
Nice! Any observations from performance comparisons (cg on gpu vs cpu)?
using LinearAlgebra, CuArrays, IterativeSolvers
function f()
for T in (Float32, Float64)
for n in (200, 1000, 5000)
A = rand(T, n, n)
b = rand(T, n)
c = similar(b) .= 0
if n == 200
cg!(c, A, b)
c .= 0
end
t = @elapsed cg!(c, A, b)
println("CPU: T = $T, n = $n, t = $t")
end
for n in (200, 1000, 5000)
A = CuArray(rand(T, n, n))
b = CuArray(rand(T, n))
c = similar(b) .= 0
if n == 200
cg!(c, A, b)
c .= 0
end
t = @elapsed cg!(c, A, b)
println("GPU: T = $T, n = $n, t = $t")
end
end
end
f()
#=
CPU: T = Float32, n = 200, t = 0.006496639
CPU: T = Float32, n = 1000, t = 0.073049358
CPU: T = Float32, n = 5000, t = 36.184316022
GPU: T = Float32, n = 200, t = 0.043137831
GPU: T = Float32, n = 1000, t = 0.252440844
GPU: T = Float32, n = 5000, t = 8.522336734
CPU: T = Float64, n = 200, t = 0.006768321
CPU: T = Float64, n = 1000, t = 0.255374633
CPU: T = Float64, n = 5000, t = 57.059444524
GPU: T = Float64, n = 200, t = 0.053543029
GPU: T = Float64, n = 1000, t = 0.400916859
GPU: T = Float64, n = 5000, t = 16.084756801
=#
Not to mention DistributedArrays.jl
also just works :)
What about GMRES :)
What about GMRES :)
If I can make LOBPCG work, I can probably make any of them work :) The main challenge is in avoiding non-contiguous views like https://github.com/JuliaMath/IterativeSolvers.jl/blob/01af27fe76069e6be566abfd862664f928997638/src/gmres.jl#L246. The only workaround I can think of is using a contiguous view of a GPUVector reshaped into a GPUMatrix, that works. But I will need to be careful to make sure no undesirable side-effects happen, e.g. overwriting wrong parts of the "matrix" that I shouldn't. If that's a problem, I can just make the vector another field and copy what I need from the matrix using broadcasts with getindex
and setindex!
but this will increase the memory. Anyways, GMRES seems a lot easier than LOBPCG so let's see.
In GMRES you should keep the (low-dimensional) Hessenberg matrix on the CPU, the Krylov basis should go to the GPU. I made that contiguous-view-PR in CuArrays actually exactly because I wanted to implement GMRES on the GPU ^^
Basically for all these projection methods there is some high-dimensional stuff with simple number crunching that can go to the GPU, and some low-dimensional stuff where you have to do LAPACK style linalg, so that is fit for the CPU. (Also with distributed arrays you want each core to do the low-dimensional stuff locally, and the high-dimensional stuff distributed)
Basically for all these projection methods there is some high-dimensional stuff with simple number crunching that can go to the GPU, and some low-dimensional stuff where you have to do LAPACK style linalg, so that is fit for the CPU. (Also with distributed arrays you want each core to do the low-dimensional stuff locally, and the high-dimensional stuff distributed)
The distributed case is more complicated and would probably need a separate treatment, it just working for cg
is just a happy extra nugget. As for the small and large number crunching, it may be worth benchmarking both versions. Initiating a data transfer between the CPU and GPU is not trivial and may end up outweighing any benefits from performing the small number crunching on the CPU.
If you implement V::Matrix{T}
and H::Matrix{T}
as Vector{Vector{T}}
it can't help you?
I did that for DQGMRES
in the Krylov.jl
repository to avoid the view
which use 48 bytes each time... And also to be able to use BLAS.dot with the column of H.
It's a good idea to have implementations that work on GPU
/ CPU
. I should test it too with our methods.
I went with @haampie's suggestion of doing the small computations on the CPU and the big ones on GPU because of 2 reasons:
I still suspect having the whole thing on the GPU might be faster because of communication overhead but for sake of time and usability with DistributedArrays, I went with @haampie's approach in the LOBPCG PR.
It seems that the speedup is pretty good for CSR sparse matrices as well:
using Random, SparseArrays, LinearAlgebra, CuArrays, CuArrays.CUSPARSE, IterativeSolvers
CuArrays.allowscalar(false)
seed = 123321
tol = 1e-4
npercol = 10
ns = (10_000, 100_000, 1_000_000, 5_000_000)
function getmatrix(::Type{T}, n, npercol) where {T}
A = sprand(T, n, n, (npercol - 1)/2/n)
A = A + A' + 2*n*I
A
end
function f()
global seed, tol, ns, npercol
T = Float64
Random.seed!(seed)
for n in ns
A = getmatrix(T, n, npercol)
b = rand(T, n)
x = rand(T, n)
statevars = CGStateVariables(similar(x), similar(x), similar(x))
if n == ns[1]
cg!(x, A, b, statevars=statevars, maxiter=n÷2, tol=tol)
x = rand(T, n)
end
t = @elapsed cg!(x, A, b, statevars=statevars, maxiter=n÷2, tol=tol)
println("CPU: T = $T, n = $n, t = $t")
end
Random.seed!(seed)
for n in ns
A = CuSparseMatrixCSR(getmatrix(T, n, npercol))
b = CuArray(rand(T, n))
x = CuArray(rand(T, n))
statevars = CGStateVariables(similar(x), similar(x), similar(x))
if n == ns[1]
cg!(x, A, b, statevars=statevars, maxiter=n÷2, tol=tol)
x = CuArray(rand(T, n))
end
t = @elapsed cg!(x, A, b, statevars=statevars, maxiter=n÷2, tol=tol)
println("GPU: T = $T, n = $n, t = $t")
end
end
f()
#=
CPU: T = Float64, n = 10000, t = 0.001078335
CPU: T = Float64, n = 100000, t = 0.011978747
CPU: T = Float64, n = 1000000, t = 0.250723691
CPU: T = Float64, n = 5000000, t = 2.267166389
GPU: T = Float64, n = 10000, t = 0.001863109
GPU: T = Float64, n = 100000, t = 0.003564305
GPU: T = Float64, n = 1000000, t = 0.042549204
GPU: T = Float64, n = 5000000, t = 0.221863912
=#
Sorry for being so late to handle this pr. Thinking about this a bit more: I'm not a big fan of making this package depend on GPUArrays, so my suggestion would be to have an example in the docs or maybe in the docstring how to use it with GPUArrays. The fact that Julia segfaults with mixed vectors on CPU / GPU feels like an issue that should be fixed upstream in GPUArrays, no?
So turns out
IterativeSolvers.cg
mostly works on the GPU, but needed a few corner cases handled to make it more friendly. This PR hopefully does this. With this PR, this works:IterativeSolvers.cg!
already worked without this PR when all ofx
,A
andb
are on the GPU. If one of them is not, Julia segfaults. I did the check forx
andb
to give a friendly error, however sinceA
can be a linear operator, I left it unchecked. Feedback appreciated!