JuliaGPU / CUDA.jl

CUDA programming in Julia.
https://juliagpu.org/cuda/
Other
1.21k stars 221 forks source link

[Feature Request] Determinant Function #110

Open MasonProtter opened 4 years ago

MasonProtter commented 4 years ago

We currently don't have a way to take determinants of matrices on the GPU as far as I know. It'd be really nice if we could do that.

carstenbauer commented 2 years ago

Came up here again: https://discourse.julialang.org/t/determinant-of-cuda-matrix/77245/3

I mentioned

julia> M = rand(10,10);

julia> Mgpu = cu(M);

julia> det(M) # cpu
-0.10243723110926993

julia> prod(diag(LinearAlgebra.LAPACK.getrf!(M)[1])) # cpu
-0.10243723110926993

julia> prod(diag(CUSOLVER.getrf!(Mgpu)[1])) # gpu
-0.10243723f0

But I think the sign is uncertain here!

HenriDeh commented 2 years ago

Indeed, there is a sign issue. I get a negative determinant for a PSD matrix:

LinearAlgebra.det(m::CuMatrix) = prod(diag(CUSOLVER.getrf!(m)[1])) #Shameless type piracy
A = rand(10,10)
S = A*A'
S_d = cu(Σ)
det(S) == -det(S_d) #true

Since I only need it for PSD matrices, I can work it around by adding taking the abs. I can't help more though; I don't have a strong background in LA.

stevengj commented 2 years ago

But I think the sign is uncertain here!

You have to multiply by the sign of the permutation. Something like:

function det!(A::CuMatOrAdj)
    A, ipiv = CUSOLVER.getrf!(A)
    return prod(diag(A)) * (isodd(count(((k,i),) -> k != i, enumerate(ipiv))) ? -1 : 1)
end
LinearAlgebra.det(A::CuMatOrAdj) = det!(copy(A))

PS. This is not "type piracy" because the CuMatOrAdj type is specific to this package.

stevengj commented 2 years ago

Note that you'll also want to provide logdet and logabsdet, which can be computed with much the same code (you just take the sum of the logs of the |diagonals| instead of the product, and keep track of the signs).

carstenbauer commented 2 years ago

Was going to suggest the same as Steven, i.e. multiplying by the sign of the permutation:

My draft implementation was

function det!(m::CuMatOrAdj)
   X_d, ipiv_d = CUSOLVER.getrf!(m)
   diags = Array(diag(X_d))
   ipiv = Array(ipiv_d)
   det = one(eltype(diags))
   @inbounds for i in 1:length(ipiv)
       det *= diags[i]
       if i != ipiv[i]
           det = -det
       end
   end
   return det
end
LinearAlgebra.det(A::CuMatOrAdj) = det!(copy(A))

I guess the only question is how to most efficiently implement this.

BTW, do we need a CUDA.@sync to be safe?

carstenbauer commented 2 years ago

PS. This is not "type piracy" because the CuMatOrAdj type is specific to this package.

I guess he meant that this is type piracy in his code (i.e. outside of CUDA.jl).

HenriDeh commented 2 years ago

PS. This is not "type piracy" because the CuMatOrAdj type is specific to this package.

I guess he meant that this is type piracy in his code (i.e. outside of CUDA.jl).

Yes, it's a copy-paste from my code, where it is piracy.

HenriDeh commented 2 years ago

All of the implementations proposed here are much slower than a CPU equivalent I fear. I made this benchmark where I compute the logdet without needing the LU decomposition (I assume it was done before):

using CUDA,LinearAlgebra,BenchmarkTools

L = LowerTriangular(rand(32,32)) 
L_d = L |> cu

#Given L or U, compute logdet of L*U
function mylogdetLorU(LorU::Union{LowerTriangular{<:Number, <:CuArray}, UpperTriangular{<:Number, <:CuArray}})
    return 2*mapreduce(log, +, diag(LorU), init = 0f0) #fastest I could find
end

@btime mylogdetLorU($L_d) #85.6 μs
@btime 2*logdet($L) #186.18 ns

I find identical performance with the above implementations of det! (from which I strip off the getrf! since in my case it's already done).

I think they allocate too much (the CPU implementation does not allocate at all).

SBuercklin commented 2 years ago

I have a logdet implementation inspired by @carstenbauer's code above, and it's faster than the CPU implementation for the case I care about:

function logdet!(m)
   X_d, ipiv_d = CUSOLVER.getrf!(m)
   ipiv = Array(ipiv_d)

   ldet = sum(log, diag(X_d))
   ldet += im * π * get_count(ipiv)

   return real(ldet) + im*(imag(ldet) % 2π)
end

function get_count(ipiv)
    return @inbounds count(i -> isequal(i, ipiv[i]), 1:length(ipiv))
end

my_logdet(A) = logdet!(copy(A))

Timing it:

julia> A = randn(ComplexF64, 1024, 1024);

julia> A_d = CuArray(A);

julia> @belapsed logdet($A)
0.03181127

julia> @belapsed my_logdet($A_d)
0.007738641

julia> isapprox(logdet(A), my_logdet(A_d); rtol = 1e-8)
true

get_count is non-allocating, but we could do it with a mapreduce or similar. You should be able to do the same thing for det and determine the sign similarly. My only concern about this is we still need to pull the ipiv array to the CPU, is there a nice built-in way to do it on GPU?

HenriDeh commented 2 years ago

I guess the overhead of diag allocating on the GPU is insignificant when one gets the GPU speed up for the decomposition. My use case is probably too specific to deserve an investigation here.