ceslobfer / GPUmatrix

GPUmatrix: An R package for algebra using GPU
32 stars 1 forks source link

Problems solving sparse linear least squares regression using Cholesky decomposition #2

Closed tomwenseleers closed 1 year ago

tomwenseleers commented 1 year ago

Thanks for your great work on this package!

I was wondering if you could advise me a bit on how to use GPUmatrix to solve a sparse linear least squares regression using Cholesky decomposition.

Given my sparse covariate matrix Xswaug and dense outcome variable ywaug I was doing this in base R as follows:

library(devtools)
devtools::install_github("tomwenseleers/L0glm/L0glm")
library(L0glm)
sim = simulate_spike_train(n=10000, p=10000, k=200, sparse=TRUE) # blurred spike train with 10000 timepoints & 200 blurred peaks
library(Matrix)
Xs = sim$X # sparse covariate matrix Xs of class dgCMatrix (time-shifted copies of the point spread function)
y = sim$y
weights <- 1/(y+0.1) # observation weights (approx 1/Poisson noise variance)
yw = y*sqrt(weights) # weighted y
ywaug = c(yw, rep(0, ncol(Xs))) # augmented weighted y
lambdas <- rep(1, ncol(Xs)) # ridge penalties on individual coefficients
Xsw <- Diagonal(x=sqrt(weights)) %*% Xs # weighted X = X*sqrt(weights)
Xswaug <- rbind(Xsw, Diagonal(x=sqrt(lambdas))) # row augmented covariate matrix

# Cholesky decomposition based solve of weighted least squares problem
chol.solve = function (X, y, tol=.Machine$double.eps^(2/3)) {
  ch <- base::chol(crossprod(X)) # solve using Cholesky decomposition
  backsolve(ch, forwardsolve(ch, crossprod(X, y), upper = TRUE, trans = TRUE))
}

This runs very fast when using Intel MKL (Microsoft R Open 4.0.2) (using 8 cores on my Intel Core i9) : system.time(s1 <- chol.solve(Xswaug, ywaug)) # 1.4s when using Intel MKL (Microsoft R Open 4.0.2), 190s without Intel MKL

I thought the equivalent in GPUmatrixwould be to use chol_solve, but for some reason my timings get much worse compared to the Intel MKL Cholesky solve (39s with GPU version of tensorflow installed or 24s with CPU version of tensorflow installed, this is using my Nvidia RTX A2000 laptop GPU) :

# devtools::install_github(" ceslobfer/GPUmatrix")
library(GPUmatrix)
system.time({Xswaug_gpu <- gpu.matrix(Xswaug, type = "tensorflow", dtype = "float32", sparse = T)
            ywaug_gpu <- gpu.matrix(ywaug, type = "tensorflow", dtype = "float32", sparse = F)  
            s2 <- chol_solve(chol(crossprod(Xswaug_gpu)), crossprod(Xswaug_gpu, ywaug_gpu))}) # 39s with GPU version of tensorflow, 24s with CPU version of tensorflow

Is this to be expected? Or am I doing something wrong here, and would I need explicitly defined backsolve& forwardsolvefunctions here?

If I use torchas a backend I get these errors, no matter if I use the CPU or the GPU versions:

system.time({Xswaug_gpu <- gpu.matrix(Xswaug, device = "cpu", dtype = "float32", sparse = T)
             ywaug_gpu <- gpu.matrix(ywaug, device = "cpu", dtype = "float32", sparse = F)  
             s_toch_cpu <- chol_solve(chol(crossprod(Xswaug_gpu)), crossprod(Xswaug_gpu, ywaug_gpu))})
# returns errors
# [W C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\BatchLinearAlgebra.cpp:1626] Warning: torch.cholesky is deprecated in favor of torch.linalg.cholesky and will be removed in a future PyTorch release.
# L = torch.cholesky(A)
# should be replaced with
# L = torch.linalg.cholesky(A)
# and
# U = torch.cholesky(A, upper=True)
# should be replaced with
# U = torch.linalg.cholesky(A).mH().
# This transform will produce equivalent results for all valid (symmetric positive definite) inputs. (function operator ())
# Error in (function (self, input2, upper)  : 
#             A must be batches of square matrices, but they are 10000 by 1 matrices
#           Exception raised from linearSolveCheckInputs at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen/native/LinearAlgebraUtils.h:288 (most recent call first):
#             00007FFCBD16AD1200007FFCBD16ACB0 c10.dll!c10::Error::Error [<unknown file> @ <unknown line number>]
#           00007FFCBD16A79E00007FFCBD16A750 c10.dll!c10::detail::torchCheckFail [<unknown file> @ <unknown line number>]
#           00007FFC3935F1B900007FFC3935EB90 torch_cpu.dll!at::native::linalg_vecdot_out [<unknown file> @ <unknown line number>]
#           00007FFC3934D75500007FFC3934D270 torch_cpu.dll!at::native::_cholesky_solve_helper_cpu [<unknown file> @ <unknown line number>]
#           00007FFC39351C3200007FFC39351B10 torch_cpu.dll!at::native::cholesky_solve [<unknown file> @ <unknown line number>]
#           00007FFC3A10946E00007FFC3A1080E0 torch_cpu.dll!at::compositeexplicitautograd::xlogy_ [<unknown file> @ <unknown line number>]
#           00007FFC3A0DD46A00007FFC3A0B7CA0 torch_c
#           In addition: Warning message:
#             In warningSparseTensor_torch(y) :
#             Not allowed with sparse matrix, matrix will be cast to dense for the operation. May cause memory problems
#           Timing stopped at: 7.84 0.37 3.21

system.time({Xswaug_gpu <- gpu.matrix(Xswaug, device = "gpu", dtype = "float32", sparse = T)
                                      ywaug_gpu <- gpu.matrix(ywaug, device = "gpu", dtype = "float32", sparse = F)  
                                      s_toch_gpu <- chol_solve(chol(crossprod(Xswaug_gpu)), crossprod(Xswaug_gpu, ywaug_gpu))}) # 39s
# returns errors
# Error in (function (self, input2, upper)  : 
#             A must be batches of square matrices, but they are 10000 by 1 matrices
#           Exception raised from linearSolveCheckInputs at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen/native/LinearAlgebraUtils.h:288 (most recent call first):
#             00007FFCBD16AD1200007FFCBD16ACB0 c10.dll!c10::Error::Error [<unknown file> @ <unknown line number>]
#           00007FFCBD16A79E00007FFCBD16A750 c10.dll!c10::detail::torchCheckFail [<unknown file> @ <unknown line number>]
#           00007FFC3935F1B900007FFC3935EB90 torch_cpu.dll!at::native::linalg_vecdot_out [<unknown file> @ <unknown line number>]
#           00007FFC3934D75500007FFC3934D270 torch_cpu.dll!at::native::_cholesky_solve_helper_cpu [<unknown file> @ <unknown line number>]
#           00007FFC39351C3200007FFC39351B10 torch_cpu.dll!at::native::cholesky_solve [<unknown file> @ <unknown line number>]
#           00007FFC3A10946E00007FFC3A1080E0 torch_cpu.dll!at::compositeexplicitautograd::xlogy_ [<unknown file> @ <unknown line number>]
#           00007FFC3A0DD46A00007FFC3A0B7CA0 torch_c
#           In addition: Warning message:
#             In warningSparseTensor_torch(y) :
#             Not allowed with sparse matrix, matrix will be cast to dense for the operation. May cause memory problems
#           Timing stopped at: 6.45 0.3 1.54

Any thoughts for how to get rid of these errors and how to obtain good performance for this usage case?

arubio2 commented 1 year ago

Dear Tom,

Thanks for the comments. The answer is not easy at all and the code is OK.

I was able to replicate your results.

system.time(s1 <- chol.solve(Xswaug, ywaug)) # 1.95s when using Intel MKL
(twicking R version 4.2.0 on i7-10700, 8 cores)

I removed the initialization phase to do a fairer comparison:

Xswaug_gpu <- gpu.matrix(Xswaug, type = "tensorflow", dtype = "float32", sparse = T)

ywaug_gpu <- gpu.matrix(ywaug, type = "tensorflow", dtype = "float32", sparse = F)

system.time(s3 <- chol_solve(chol(crossprod(Xswaug_gpu)), crossprod(Xswaug_gpu,ywaug_gpu))) # 13.75s on RTX 2080Ti

Using float64 the results are slower: 29.94s on the RTX 2080Ti. The critical point is the Cholesky decomposition. It does not seem to be well implemented in torch (or, at least MKL does a much better job). In addition, the crossprod of two sparse matrices is not yet implemented in torch, and we casted one of them to a big full matrix. The rule of thumb is that the more complex the operation, the worse the performance of the CPU.

One possibility (since you are dealing with sparse matrices) is using the Conjugate Gradient method to solve the system of equations. This method only involves the multiplication of a matrix by a vector This is the implementation:

################################################

Example of conjugate gradient with GPU matrix

Conjugate gradient function

cg_solve2 <- function(A, b, x = NULL, tol=1e-8) {

We will solve AtA x = Atb without explicitly compute AtA

At <- t(A)

Atb <- At %*% b

if(is.null(x)) x <- b[1:ncol(A)]

r <- Atb - At %% (A %% x);

p <- (r);

rsold <- sum(r^2)

for (i in 1:length(Atb)) {

Ap = At %*% (A %*% p);

alpha <- drop(rsold / crossprod(p,Ap))

x = x + alpha * p;

r = r - alpha * Ap;

rsnew <- sum(r^2);

if (sqrt(rsnew) < tol) break

p = r + (rsnew / rsold) * p;

rsold = rsnew;

}

return(x)

}

###############################################

The results are similar but the performance is different:

system.time(s1 <- chol.solve(Xswaug, ywaug))

user system elapsed

9.90 0.19 2.02

system.time(s2 <- cg_solve2(Xswaug, ywaug, xinit))

user system elapsed

0.25 0.00 0.04

system.time(s3 <- chol_solve(chol(crossprod(Xswaug_gpu)), crossprod(Xswaug_gpu,ywaug_gpu)))

user system elapsed

14.47 0.54 13.61

Warning message:

In MatProdGPUmat(x, y) :

Not allowed with two sparse matrix, the smallest matrix will be cast to dense for the operation. May cause memory problems

system.time(s4 <- cg_solve2(Xswaug_gpu, ywaug_gpu, xinit_gpu))

user system elapsed

0.34 0.00 0.34

max(abs(s2-s1))

[1] 1.453544e-10

max(abs(s3-s1))

[1] 2.365705e-07

max(abs(s4-s1))

[1] 5.206683e-07

Results using float64 are the following:

xinit <- rnorm(10000)

Xswaug_gpu <- gpu.matrix(Xswaug, type = "tensorflow", dtype = "float64", sparse = T)

ywaug_gpu <- gpu.matrix(ywaug, type = "tensorflow", dtype = "float64", sparse = F)

xinit_gpu <- gpu.matrix(xinit, type = "tensorflow", dtype = "float64", sparse = F)

system.time(s1 <- chol.solve(Xswaug, ywaug))

user system elapsed

10.10 0.14 2.04

system.time(s2 <- cg_solve2(Xswaug, ywaug, xinit))

user system elapsed

0.25 0.00 0.03

system.time(s3 <- chol_solve(chol(crossprod(Xswaug_gpu)), crossprod(Xswaug_gpu,ywaug_gpu)))

user system elapsed

28.51 0.67 28.76

Warning message:

In MatProdGPUmat(x, y) :

Not allowed with two sparse matrix, the smallest matrix will be cast to dense for the operation. May cause memory problems

system.time(s4 <- cg_solve2(Xswaug_gpu, ywaug_gpu, xinit_gpu))

user system elapsed

0.32 0.00 0.32

max(abs(s2-s1))

[1] 2.058771e-10

max(abs(s3-s1))

[1] 5.771244e-08

max(abs(s4-s1))

[1] 5.771884e-08

I am still scratching my head... This method is definitely faster and the results are accurate (in fact, the conjugate gradient is known to be very accurate if the tol parameter is even smaller). The side-effect of the improvement in the code is that… well, that MKL also improves a lot. Both methods (in float 64) are roughly 100 times faster. My guess was that the GPU was going to improve more, but it is not the case.

Our experience in matrix multiplication is that the speed in the GPU and in a good CPU are similar. It seems that in this case, the fact of using sparse matrices gives an edge for the CPU.

I tested the same algorithm on full matrices, and seems that in that case, GPU is faster (at least for this example):

set.seed(1)

A <- matrix(rnorm(1e7),1e4,1e3)

b <- rnorm(1e4)

system.time(s1 <- qr.solve(A,b)) # standard method in R

user system elapsed

13.67 0.01 2.77

system.time(s2 <- chol.solve(A,b)) # Accelerated method developed by you

user system elapsed

0.50 0.00 0.06

system.time(s3 <- cg_solve2(A, b)) # Conjugate gradient with full matrices in CPU

user system elapsed

4.90 0.00 0.64

max(abs(s3-s1))

[1] 1.03386e-13

max(abs(s2-s1))

[1] 9.194034e-17

library(GPUmatrix)

gpuA <- gpu.matrix(A, type = "float32")

gpub <- gpu.matrix(b, type = "float32")

system.time(s4 <- cg_solve2(gpuA, gpub)) # Conjugate gradient with full matrices in GPU float32

user system elapsed

0.23 0.00 0.24

max(abs(s4-s1)) # Faster than conjugate gradient in CPU but slower than chol

[1] 1.029281e-13

gpuA <- gpu.matrix(A, type = "float64")

gpub <- gpu.matrix(b, type = "float64")

system.time(s4 <- cg_solve2(gpuA, gpub)) # Conjugate gradient with full matrices in GPU float64

user system elapsed

0.27 0.00 0.27

max(abs(s4-s1)) # Similar speed as float32. It seems that overhead is important.

[1] 1.029281e-13

arubio2 commented 1 year ago

Regarding the torch. We will check htat. While the GPUmatrix was being checked at CRAN, torch cahnged its dependency to a different CUDA version (now is 11.7) and was updated. We have to rewrite the code related to torch to remove warnings and, of course, errors.

tomwenseleers commented 1 year ago

Many thanks for your detailed replies - that's super helpful! Bit of a shame maybe that support for sparse matrices/tensors is not better developed yet in tensorflow and torch - that no doubt must explain the timings. I also tried the RcppEigen sparse solvers (https://eigen.tuxfamily.org/dox/group__TopicSparseSystems.html) and SimplicialLLT and SimplicialLDLT and ConjugateGradient all had good timings too, similar to Cholesky with IntelMKL above, but with the advantage that timings are independent of the installed BLAS in R (LeastSquaresConjugateGradient was also good, but solution was not always very stable).

ceslobfer commented 1 year ago

Hello Tom, thank you for using our package and giving feedback. It is very useful for us to polish the details. Indeed, the function for performing Cholesky in TensorFlow is not very refined. We have noticed a small bug in the chol_sol() function of GPUmatrix using Torch, the arguments are reversed. It was an oversight on our part not to realize this, as our test conditions did not fail. All you have to do in the code is the following: system.time({Xswaug_gpu <- gpu.matrix(Xswaug, device = "gpu", dtype = "float32", sparse = T) ywaug_gpu <- gpu.matrix(ywaug, device = "gpu", dtype = "float32", sparse = F) s_toch_gpu <- chol_solve( crossprod(Xswaug_gpu, ywaug_gpu),chol(crossprod(Xswaug_gpu)))})

I have tried running it and did not have any issues doing it this way. Additionally, the execution was quite fast. You can try doing it like this. We will fix the bug on CRAN shortly. Thank you again for the feedback.

arubio2 commented 1 year ago

In my computer (i7 10700 + RTX 2080 Ti) This code

library(GPUmatrix)
Xswaug_gpu <- gpu.matrix(Xswaug, dtype = "float32", sparse = T)
ywaug_gpu <- gpu.matrix(ywaug, dtype = "float32", sparse = F)  
system.time(s2 <- chol_solve(crossprod(Xswaug_gpu, ywaug_gpu),chol(crossprod(Xswaug_gpu))))

takes 0.40 s (float32) on torch (the default value). Using float64,...

Xswaug_gpu <- gpu.matrix(Xswaug, dtype = "float64", sparse = T)
ywaug_gpu <- gpu.matrix(ywaug, dtype = "float64", sparse = F)  
system.time(s2 <- chol_solve(crossprod(Xswaug_gpu, ywaug_gpu),chol(crossprod(Xswaug_gpu))))

it takes 1.2s. Using the CPU (in my computer) took around 1.9s. It seems that the implementation in torch is better than in Tensorflow.