Open dpokrovsky opened 6 years ago
@dpokrovsky you float instead of double - can be 10x+ faster than double (see discussion here https://github.com/cdeterman/gpuR/issues/92) Also do you use cuda or opencl backend?
@dselivanov thanks for the answer! I use OpenCL backend (from CUDA SDK), not gpuRcuda BTW, I reworked the script to perform microbenchmark and system.time:
library(gpuR)
library(microbenchmark)
listContexts()
setContext(3L) #manually set GPU context here
print(currentContext())
print(currentDevice())
rm(list=ls())
v_size<-1000L
v_by<-0.1
v_from<-1L
v_to<-v_from+(v_size-v_from)*v_by
v_times<-100L
r_size<-500L
c_size<-v_size/r_size
v_type<-"float"
writeLines(c("\nType is: ",v_type))
A<-matrix(seq(v_from,v_to,by=v_by),nrow=r_size,ncol=c_size)
B<-matrix(seq(v_from,v_to,by=v_by),nrow=r_size,ncol=c_size)
vclA<-vclMatrix(seq(v_from,v_to,by=v_by),nrow=r_size,ncol=c_size,type=v_type)
vclB<-vclMatrix(seq(v_from,v_to,by=v_by),nrow=r_size,ncol=c_size,type=v_type)
gpuA<-gpuMatrix(seq(v_from,v_to,by=v_by),nrow=r_size,ncol=c_size,type=v_type)
gpuB<-gpuMatrix(seq(v_from,v_to,by=v_by),nrow=r_size,ncol=c_size,type=v_type)
writeLines("\nmicrobenchmark::")
mbm<-microbenchmark(base=tcrossprod(A,B),
vcl=tcrossprod(vclA,vclB),
gpu=tcrossprod(gpuA,gpuB),
times=v_times,unit="ms")
print(mbm)
ct_base<-c()
ct_vcl<-c()
ct_gpu<-c()
for(i in 1L:v_times){
ct_base<-c(ct_base,system.time(tcrossprod(A,B))[3])
ct_vcl<-c(ct_vcl,system.time(tcrossprod(vclA,vclB))[3])
ct_gpu<-c(ct_gpu,system.time(tcrossprod(gpuA,gpuB))[3])
}
writeLines("\nsystem.time::")
writeLines("base:")
print(summary(unname(ct_base)))
writeLines("\nvcl:")
print(summary(unname(ct_vcl)))
writeLines("\ngpu:")
print(summary(unname(ct_gpu)))
but the result is almost the same even for 'float':
[1] 3
$device
[1] "GeForce 840M"
$device_index
[1] 1
$device_type
[1] "gpu"
Type is:
float
microbenchmark::
Unit: milliseconds
expr min lq mean median uq max neval
base 0.442232 0.7695545 1.029230 0.8057005 0.840508 4.949334 100
vcl 7.684386 8.1025195 8.829765 8.2830260 8.590491 16.784269 100
gpu 11.683210 12.0156655 12.767142 12.3113045 12.605382 27.770439 100
system.time::
base:
Min. 1st Qu. Median Mean 3rd Qu. Max.
0e+00 0e+00 0e+00 6e-04 0e+00 2e-02
vcl:
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0000 0.0000 0.0100 0.0076 0.0125 0.0200
gpu:
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0000 0.0100 0.0100 0.0124 0.0200 0.0300
CPU is the best, in-memory GPU is much worse and GPU for calculation only is the worst.
I've checked on my side and can confirm that it works slow. I've compared to cupy
and it seems gpuR with opencl backend is ~20x slower. Not sure why is that...
@dpokrovsky also I suggest to use gpuR::synchronize()
along with GPU code.
Actually with cuda backend performance is same ~20x worse compared to cupy on the same hardware. @cdeterman any ideas?
@dpokrovsky what version of gpuR are you using? @dselivanov With the cupy
comparison, are you just comparing the tcrossprod
function or do you see the performance difference in all operations?
I compared dot product, crossproduct, tcrossproduct
@dselivanov And the 20X slowdown is consistent across them all? I am mostly surprised by the dot product comparison (for matrix multiplication). I see that @dpokrovsky is using GeForce 840M, what GPU are you using?
Gtx 680 with both cuda and opencl backends. I think we need to set up minimal benchmark set. So performance can be reported on different platforms.
ср, 30 мая 2018 г., 18:55 Charles Determan notifications@github.com:
@dselivanov https://github.com/dselivanov And the 20X slowdown is consistent across them all? I am mostly surprised by the dot product comparison (for matrix multiplication). I see that @dpokrovsky https://github.com/dpokrovsky is using GeForce 840M, what GPU are you using?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cdeterman/gpuR/issues/130#issuecomment-393194060, or mute the thread https://github.com/notifications/unsubscribe-auth/AE4u3WBG4ppdROH4TPRcLvFeEuvGbxk_ks5t3rLigaJpZM4UM4jM .
@cdeterman I'm using ver 2.0.0, built for x64 with CUDA 9.0 OpenCL header and lib. The driver version is 397.93 (NVIDIA Corporation: OpenCL 1.2 CUDA 9.2.127)
@dselivanov I am working on updating gpuRbenchmark for this purpose. Then there can be a stable location where benchmarks can be maintained for all to use.
@dpokrovsky @dselivanov I have updated gpuRbenchmark to include benchmark_gemm
, benchmark_crossprod
and benchmark_tcrossprod
. Even with my Intel HD Graphics GPU I am seeing at least the float
performance exceeding the base R.
There is an important point to note that the first run of any function in gpuR
, and thereby ViennaCL
, there is the initial OpenCL kernel compilation. That is an initial overhead only on the first run. Subsequent runs will not have that overhead. As such, I recommend running any of the benchmark_*
functions once with N=2
and then once again with however many iterations you wish. Let me know what comes from these benchmarks.
I can confirm that on my old hardware I can see speedup ~3.7 of the gpuR on gtx680 over openblas on 4-core xeon X3470. For the cupy
it seems I've benchmarked it wrong way (seems similar "synchronisation" affected result). So cupy
shows more or less similar results to gpuR
.
@cdeterman , sorry, I am on R ver 3.5.0. It says that gpuRbenchmark is not for it. Will try to set 3.4 aside for testing.
@dpokrovsky basically you need only these functions - you can just copy them
@dselivanov thank you for the hint!
I re-run my tests with those functions (not benchmark_gemm
alone) and confirm there is speed bump for CL implementation (this one for tcrossprod):
microbenchmark for tcrossprod::
> mbm <- microbenchmark(base = tcrossprod(A, B),
+ vcl = vcl_tcrossprod(vclA, vclB),
+ gpu = vcl_tcrossprod(gpuA, gpuB),
+ times = v_times, unit = "ms")
> print(mbm)
Unit: milliseconds
expr min lq mean median uq max neval
base 72.218559 73.984139 78.38324 75.154648 77.088018 135.15069 500
vcl 8.698268 9.148756 10.07660 9.341981 9.602143 31.31903 500
gpu 15.906960 17.096657 17.93813 17.361728 17.732337 38.11137 500
@cdeterman, so did I get it right that we basically need to always wrap the gpuR functions (with synchronize
call within) in order to have the speed benefits from using them?
@dpokrovsky no :-D ! it is used just for fair benchmark.
Computation in gpuR
happens asynchronously. Without synchronize()
control will be returned to R interpreter immediately. So it will look like any computation takes only couple of microseconds.
What BLAS (use sessionInfo()
to check) do you use for CPU and what are the sizes of A
and B
matrices?
@dselivanov thanks for a clarification on that.
I use no BLAS for CPU, just base
matrix products. The matrix size is 250000 (500x500), also tested 2500x100.
@dpokrovsky I think if you will install openblas your CPU matrix products will be faster than GPU
@dselivanov @dpokrovsky OpenBLAS may be faster. These things are completely dependent upon the hardware and also the size of matrices. I had a previous system where I used OpenBLAS and the GPU (a NVIDIA card) implementation did perform better. Unfortunately I no longer have that system. I believe it was a GeForce GTX 970.
Hello, I've tried to benchmark the performance boost with the following code:
but results disappoint me a little:
It seems that vcl is definitely a fastest option (~4 times faster than non-gpu run), but gpu class is hugely slower compared to others! Did I do something wrong here (any special guidelines when and how to use gpuMatrix to get performance improvements) or this not what it supposed to be in the first place?
Thanks in advance!