Open MasonProtter opened 4 years ago
Came up here again: https://discourse.julialang.org/t/determinant-of-cuda-matrix/77245/3
I mentioned
julia> M = rand(10,10);
julia> Mgpu = cu(M);
julia> det(M) # cpu
-0.10243723110926993
julia> prod(diag(LinearAlgebra.LAPACK.getrf!(M)[1])) # cpu
-0.10243723110926993
julia> prod(diag(CUSOLVER.getrf!(Mgpu)[1])) # gpu
-0.10243723f0
But I think the sign is uncertain here!
Indeed, there is a sign issue. I get a negative determinant for a PSD matrix:
LinearAlgebra.det(m::CuMatrix) = prod(diag(CUSOLVER.getrf!(m)[1])) #Shameless type piracy
A = rand(10,10)
S = A*A'
S_d = cu(Σ)
det(S) == -det(S_d) #true
Since I only need it for PSD matrices, I can work it around by adding taking the abs
. I can't help more though; I don't have a strong background in LA.
But I think the sign is uncertain here!
You have to multiply by the sign of the permutation. Something like:
function det!(A::CuMatOrAdj)
A, ipiv = CUSOLVER.getrf!(A)
return prod(diag(A)) * (isodd(count(((k,i),) -> k != i, enumerate(ipiv))) ? -1 : 1)
end
LinearAlgebra.det(A::CuMatOrAdj) = det!(copy(A))
PS. This is not "type piracy" because the CuMatOrAdj
type is specific to this package.
Note that you'll also want to provide logdet
and logabsdet
, which can be computed with much the same code (you just take the sum of the logs of the |diagonals| instead of the product, and keep track of the signs).
Was going to suggest the same as Steven, i.e. multiplying by the sign of the permutation:
My draft implementation was
function det!(m::CuMatOrAdj)
X_d, ipiv_d = CUSOLVER.getrf!(m)
diags = Array(diag(X_d))
ipiv = Array(ipiv_d)
det = one(eltype(diags))
@inbounds for i in 1:length(ipiv)
det *= diags[i]
if i != ipiv[i]
det = -det
end
end
return det
end
LinearAlgebra.det(A::CuMatOrAdj) = det!(copy(A))
I guess the only question is how to most efficiently implement this.
BTW, do we need a CUDA.@sync
to be safe?
PS. This is not "type piracy" because the
CuMatOrAdj
type is specific to this package.
I guess he meant that this is type piracy in his code (i.e. outside of CUDA.jl).
PS. This is not "type piracy" because the
CuMatOrAdj
type is specific to this package.I guess he meant that this is type piracy in his code (i.e. outside of CUDA.jl).
Yes, it's a copy-paste from my code, where it is piracy.
All of the implementations proposed here are much slower than a CPU equivalent I fear. I made this benchmark where I compute the logdet without needing the LU decomposition (I assume it was done before):
using CUDA,LinearAlgebra,BenchmarkTools
L = LowerTriangular(rand(32,32))
L_d = L |> cu
#Given L or U, compute logdet of L*U
function mylogdetLorU(LorU::Union{LowerTriangular{<:Number, <:CuArray}, UpperTriangular{<:Number, <:CuArray}})
return 2*mapreduce(log, +, diag(LorU), init = 0f0) #fastest I could find
end
@btime mylogdetLorU($L_d) #85.6 μs
@btime 2*logdet($L) #186.18 ns
I find identical performance with the above implementations of det!
(from which I strip off the getrf! since in my case it's already done).
I think they allocate too much (the CPU implementation does not allocate at all).
I have a logdet
implementation inspired by @carstenbauer's code above, and it's faster than the CPU implementation for the case I care about:
function logdet!(m)
X_d, ipiv_d = CUSOLVER.getrf!(m)
ipiv = Array(ipiv_d)
ldet = sum(log, diag(X_d))
ldet += im * π * get_count(ipiv)
return real(ldet) + im*(imag(ldet) % 2π)
end
function get_count(ipiv)
return @inbounds count(i -> isequal(i, ipiv[i]), 1:length(ipiv))
end
my_logdet(A) = logdet!(copy(A))
Timing it:
julia> A = randn(ComplexF64, 1024, 1024);
julia> A_d = CuArray(A);
julia> @belapsed logdet($A)
0.03181127
julia> @belapsed my_logdet($A_d)
0.007738641
julia> isapprox(logdet(A), my_logdet(A_d); rtol = 1e-8)
true
get_count
is non-allocating, but we could do it with a mapreduce
or similar. You should be able to do the same thing for det
and determine the sign similarly. My only concern about this is we still need to pull the ipiv
array to the CPU, is there a nice built-in way to do it on GPU?
I guess the overhead of diag
allocating on the GPU is insignificant when one gets the GPU speed up for the decomposition. My use case is probably too specific to deserve an investigation here.
We currently don't have a way to take determinants of matrices on the GPU as far as I know. It'd be really nice if we could do that.