cdeterman / gpuR

R interface to use GPU's
241 stars 26 forks source link

Performance compared to non-gpu variants #130

Open dpokrovsky opened 6 years ago

dpokrovsky commented 6 years ago

Hello, I've tried to benchmark the performance boost with the following code:

rm(list = ls())
currentDevice()
A=matrix(as.double(seq(1,1000.9,by=0.1)),5000,2)
B=matrix(as.double(seq(1,1000.9,by=0.1)),5000,2)
vA=vclMatrix(A,type="double")
vB=vclMatrix(B,type="double")
gA=gpuMatrix(A,type="double")
gB=gpuMatrix(B,type="double")
print("Non-GPU")
system.time({tcrossprod(A,B)})
print("CL")
system.time({tcrossprod(vA,vB)})
print("GPU")
system.time({tcrossprod(gA,gB)})

but results disappoint me a little:

$device
[1] "GeForce 840M"
$device_index
[1] 1
$device_type
[1] "gpu"

[1] "Non-GPU"
   user  system elapsed 
   0.19    0.03    0.22 
[1] "CL"
   user  system elapsed 
   0.05    0.05    0.14 
[1] "GPU"
   user  system elapsed 
   0.61    0.38    1.00 

It seems that vcl is definitely a fastest option (~4 times faster than non-gpu run), but gpu class is hugely slower compared to others! Did I do something wrong here (any special guidelines when and how to use gpuMatrix to get performance improvements) or this not what it supposed to be in the first place?

Thanks in advance!

dselivanov commented 6 years ago

@dpokrovsky you float instead of double - can be 10x+ faster than double (see discussion here https://github.com/cdeterman/gpuR/issues/92) Also do you use cuda or opencl backend?

dpokrovsky commented 6 years ago

@dselivanov thanks for the answer! I use OpenCL backend (from CUDA SDK), not gpuRcuda BTW, I reworked the script to perform microbenchmark and system.time:

library(gpuR)
library(microbenchmark)

listContexts()
setContext(3L) #manually set GPU context here
print(currentContext())
print(currentDevice())

rm(list=ls())

v_size<-1000L
v_by<-0.1
v_from<-1L
v_to<-v_from+(v_size-v_from)*v_by
v_times<-100L

r_size<-500L
c_size<-v_size/r_size
v_type<-"float"
writeLines(c("\nType is: ",v_type))

A<-matrix(seq(v_from,v_to,by=v_by),nrow=r_size,ncol=c_size)
B<-matrix(seq(v_from,v_to,by=v_by),nrow=r_size,ncol=c_size)
vclA<-vclMatrix(seq(v_from,v_to,by=v_by),nrow=r_size,ncol=c_size,type=v_type)
vclB<-vclMatrix(seq(v_from,v_to,by=v_by),nrow=r_size,ncol=c_size,type=v_type)
gpuA<-gpuMatrix(seq(v_from,v_to,by=v_by),nrow=r_size,ncol=c_size,type=v_type)
gpuB<-gpuMatrix(seq(v_from,v_to,by=v_by),nrow=r_size,ncol=c_size,type=v_type)

writeLines("\nmicrobenchmark::")
mbm<-microbenchmark(base=tcrossprod(A,B),
                    vcl=tcrossprod(vclA,vclB),
                    gpu=tcrossprod(gpuA,gpuB),
                    times=v_times,unit="ms")
print(mbm)

ct_base<-c()
ct_vcl<-c()
ct_gpu<-c()

for(i in 1L:v_times){
  ct_base<-c(ct_base,system.time(tcrossprod(A,B))[3])
  ct_vcl<-c(ct_vcl,system.time(tcrossprod(vclA,vclB))[3])
  ct_gpu<-c(ct_gpu,system.time(tcrossprod(gpuA,gpuB))[3])
}

writeLines("\nsystem.time::")
writeLines("base:")
print(summary(unname(ct_base)))
writeLines("\nvcl:")
print(summary(unname(ct_vcl)))
writeLines("\ngpu:")
print(summary(unname(ct_gpu)))

but the result is almost the same even for 'float':

[1] 3
$device
[1] "GeForce 840M"

$device_index
[1] 1

$device_type
[1] "gpu"

Type is: 
float

microbenchmark::
Unit: milliseconds
 expr       min         lq      mean     median        uq       max neval
 base  0.442232  0.7695545  1.029230  0.8057005  0.840508  4.949334   100
  vcl  7.684386  8.1025195  8.829765  8.2830260  8.590491 16.784269   100
  gpu 11.683210 12.0156655 12.767142 12.3113045 12.605382 27.770439   100

system.time::
base:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0e+00   0e+00   0e+00   6e-04   0e+00   2e-02 

vcl:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0000  0.0000  0.0100  0.0076  0.0125  0.0200 

gpu:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0000  0.0100  0.0100  0.0124  0.0200  0.0300 

CPU is the best, in-memory GPU is much worse and GPU for calculation only is the worst.

dselivanov commented 6 years ago

I've checked on my side and can confirm that it works slow. I've compared to cupy and it seems gpuR with opencl backend is ~20x slower. Not sure why is that... @dpokrovsky also I suggest to use gpuR::synchronize() along with GPU code.

dselivanov commented 6 years ago

Actually with cuda backend performance is same ~20x worse compared to cupy on the same hardware. @cdeterman any ideas?

cdeterman commented 6 years ago

@dpokrovsky what version of gpuR are you using? @dselivanov With the cupy comparison, are you just comparing the tcrossprod function or do you see the performance difference in all operations?

dselivanov commented 6 years ago

I compared dot product, crossproduct, tcrossproduct

cdeterman commented 6 years ago

@dselivanov And the 20X slowdown is consistent across them all? I am mostly surprised by the dot product comparison (for matrix multiplication). I see that @dpokrovsky is using GeForce 840M, what GPU are you using?

dselivanov commented 6 years ago

Gtx 680 with both cuda and opencl backends. I think we need to set up minimal benchmark set. So performance can be reported on different platforms.

ср, 30 мая 2018 г., 18:55 Charles Determan notifications@github.com:

@dselivanov https://github.com/dselivanov And the 20X slowdown is consistent across them all? I am mostly surprised by the dot product comparison (for matrix multiplication). I see that @dpokrovsky https://github.com/dpokrovsky is using GeForce 840M, what GPU are you using?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cdeterman/gpuR/issues/130#issuecomment-393194060, or mute the thread https://github.com/notifications/unsubscribe-auth/AE4u3WBG4ppdROH4TPRcLvFeEuvGbxk_ks5t3rLigaJpZM4UM4jM .

dpokrovsky commented 6 years ago

@cdeterman I'm using ver 2.0.0, built for x64 with CUDA 9.0 OpenCL header and lib. The driver version is 397.93 (NVIDIA Corporation: OpenCL 1.2 CUDA 9.2.127)

cdeterman commented 6 years ago

@dselivanov I am working on updating gpuRbenchmark for this purpose. Then there can be a stable location where benchmarks can be maintained for all to use.

cdeterman commented 6 years ago

@dpokrovsky @dselivanov I have updated gpuRbenchmark to include benchmark_gemm, benchmark_crossprod and benchmark_tcrossprod. Even with my Intel HD Graphics GPU I am seeing at least the float performance exceeding the base R.

There is an important point to note that the first run of any function in gpuR, and thereby ViennaCL, there is the initial OpenCL kernel compilation. That is an initial overhead only on the first run. Subsequent runs will not have that overhead. As such, I recommend running any of the benchmark_* functions once with N=2 and then once again with however many iterations you wish. Let me know what comes from these benchmarks.

dselivanov commented 6 years ago

I can confirm that on my old hardware I can see speedup ~3.7 of the gpuR on gtx680 over openblas on 4-core xeon X3470. For the cupy it seems I've benchmarked it wrong way (seems similar "synchronisation" affected result). So cupy shows more or less similar results to gpuR.

dpokrovsky commented 6 years ago

@cdeterman , sorry, I am on R ver 3.5.0. It says that gpuRbenchmark is not for it. Will try to set 3.4 aside for testing.

dselivanov commented 6 years ago

@dpokrovsky basically you need only these functions - you can just copy them

dpokrovsky commented 6 years ago

@dselivanov thank you for the hint! I re-run my tests with those functions (not benchmark_gemm alone) and confirm there is speed bump for CL implementation (this one for tcrossprod):

microbenchmark for tcrossprod::
> mbm <- microbenchmark(base = tcrossprod(A, B), 
+                       vcl = vcl_tcrossprod(vclA, vclB), 
+                       gpu = vcl_tcrossprod(gpuA, gpuB), 
+                       times = v_times, unit = "ms")
> print(mbm)
Unit: milliseconds
 expr       min        lq     mean    median        uq       max neval
 base 72.218559 73.984139 78.38324 75.154648 77.088018 135.15069   500
  vcl  8.698268  9.148756 10.07660  9.341981  9.602143  31.31903   500
  gpu 15.906960 17.096657 17.93813 17.361728 17.732337  38.11137   500

@cdeterman, so did I get it right that we basically need to always wrap the gpuR functions (with synchronize call within) in order to have the speed benefits from using them?

dselivanov commented 6 years ago

@dpokrovsky no :-D ! it is used just for fair benchmark.

Computation in gpuR happens asynchronously. Without synchronize() control will be returned to R interpreter immediately. So it will look like any computation takes only couple of microseconds.

What BLAS (use sessionInfo() to check) do you use for CPU and what are the sizes of A and B matrices?

dpokrovsky commented 6 years ago

@dselivanov thanks for a clarification on that. I use no BLAS for CPU, just base matrix products. The matrix size is 250000 (500x500), also tested 2500x100.

dselivanov commented 6 years ago

@dpokrovsky I think if you will install openblas your CPU matrix products will be faster than GPU

cdeterman commented 6 years ago

@dselivanov @dpokrovsky OpenBLAS may be faster. These things are completely dependent upon the hardware and also the size of matrices. I had a previous system where I used OpenBLAS and the GPU (a NVIDIA card) implementation did perform better. Unfortunately I no longer have that system. I believe it was a GeForce GTX 970.