Open dpo opened 8 years ago
It might be expected. The time to transfer data to the GPU over PCIe can be pretty substantial. If you can make your array size a power of 2 OR do multiple ops with the same data on the GPU, you should see better perf.
I probably misunderstand how this all works, but isn't the only transfer occurring when I say d_x = CudaArray(x)
? Isn't all of cublasdots()
taking place on the GPU?
Oh derp, you're right. I think it still might be the fact that the array size is not a power of two and is a little small.
Well, ok, it starts paying off at arrays of size 2^20:
array size: 2^20
0.892670 seconds
0.647335 seconds (3.00 k allocations: 109.375 KB)
array size: 2^21
1.891142 seconds
0.839174 seconds (3.00 k allocations: 109.375 KB)
array size: 2^22
3.775395 seconds
1.492279 seconds (3.00 k allocations: 109.375 KB)
array size: 2^23
7.506833 seconds
3.100094 seconds (3.00 k allocations: 109.375 KB)
array size: 2^24
14.739128 seconds
5.848365 seconds (3.00 k allocations: 109.375 KB)
At 2^25, Julia crashes, saying it's out of memory (which is suspicious; htop
shows my memory usage as constant; I don't get such a crash when I only use BLAS.dot
).
I thought it would pay off for smaller data size. Perhaps it's my card (GeForce GT 650M). Anyways, thanks for your help!
It could be the card, especially if you have a nice CPU.
I wrote simple functions that perform dot products on
Array
s andCudaArray
s. I'm finding that the CUDA version is about 4x slower. Is this expected?Running this script gives:
(Bonus question: what's up with the EBADF???)
This is on OSX 10.9, Julia 0.4.1 installed from Homebrew, built against OpenBLAS, CUDA 7.5.