CUBLAS dot far slower than BLAS dot

dpo commented 8 years ago

I wrote simple functions that perform dot products on Arrays and CudaArrays. I'm finding that the CUDA version is about 4x slower. Is this expected?

using CUDArt
using CUBLAS

function blasdots(x :: Vector{Float64}, y :: Vector{Float64}; kmax :: Int=100)
  for k = 1:kmax
    BLAS.dot(x, y)
  end
end

function cublasdots(d_x :: CudaArray{Float64}, d_y :: CudaArray{Float64}; kmax :: Int=100)
  for k = 1:kmax
    CUBLAS.dot(d_x, d_y)
  end
end

n = 10000
x = rand(n); y = rand(n)
d_x = CudaArray(x); d_y = CudaArray(y)

blasdots(x, y, kmax=1)  # compile
@time blasdots(x, y)

cublasdots(d_x, d_y, kmax=1)  # compile
@time cublasdots(d_x, d_y)

Running this script gives:

$ julia time_cublas.jl 
  0.001865 seconds (431 allocations: 27.450 KB)
  0.007459 seconds (583 allocations: 28.250 KB)
jl_uv_writecb() ERROR: bad file descriptor EBADF
jl_uv_writecb() ERROR: bad file descriptor EBADF
jl_uv_writecb() ERROR: bad file descriptor EBADF
jl_uv_writecb() ERROR: bad file descriptor EBADF
jl_uv_writecb() ERROR: bad file descriptor EBADF

(Bonus question: what's up with the EBADF???)

This is on OSX 10.9, Julia 0.4.1 installed from Homebrew, built against OpenBLAS, CUDA 7.5.

kshyatt commented 8 years ago

It might be expected. The time to transfer data to the GPU over PCIe can be pretty substantial. If you can make your array size a power of 2 OR do multiple ops with the same data on the GPU, you should see better perf.

dpo commented 8 years ago

I probably misunderstand how this all works, but isn't the only transfer occurring when I say d_x = CudaArray(x)? Isn't all of cublasdots() taking place on the GPU?

kshyatt commented 8 years ago

Oh derp, you're right. I think it still might be the fact that the array size is not a power of two and is a little small.

dpo commented 8 years ago

Well, ok, it starts paying off at arrays of size 2^20:

array size: 2^20
  0.892670 seconds
  0.647335 seconds (3.00 k allocations: 109.375 KB)
array size: 2^21
  1.891142 seconds
  0.839174 seconds (3.00 k allocations: 109.375 KB)
array size: 2^22
  3.775395 seconds
  1.492279 seconds (3.00 k allocations: 109.375 KB)
array size: 2^23
  7.506833 seconds
  3.100094 seconds (3.00 k allocations: 109.375 KB)
array size: 2^24
 14.739128 seconds
  5.848365 seconds (3.00 k allocations: 109.375 KB)

At 2^25, Julia crashes, saying it's out of memory (which is suspicious; htop shows my memory usage as constant; I don't get such a crash when I only use BLAS.dot).

I thought it would pay off for smaller data size. Perhaps it's my card (GeForce GT 650M). Anyways, thanks for your help!

kshyatt commented 8 years ago

It could be the card, especially if you have a nice CPU.

JuliaAttic / CUBLAS.jl

CUBLAS dot far slower than BLAS dot #17