Open eric-tramel opened 9 years ago
Nicely documented! Made it very easy to see what you're doing and what you are concerned about.
I am not certain what's going on, but here's my hunch: is it possible that CUBLAS.gemm!
is asynchronous? Meaning, it returns immediately but the GPU is not yet done with the computation? And then your copy-back-to-CPU instruction blocks until it's done---therefore the copy-back is being blamed for time that's actually spent computing the product.
Also, a couple of possible efficiency improvements:
d_A = CudaArray(A)
already copies the data, no need to call copy!
thereafter. See https://github.com/JuliaGPU/CUDArt.jl/blob/c7954d7a931fcf48e017a943e4ce4cd13ae6cc48/src/arrays.jl#L103.d_C
? If so, can you just call fill!(d_C, 0)
instead?A wild guess, since I literally started looking at CUDA+Julia about an hour ago, and this is a hunch from memory of old times spent in CUDA+C.
Typically, what you observe is exactly what one observes when not synchronizing device and host before and after timing. Without synchronizing, a lot of CUDA methods just dispatch the work and return almost immediately, the only ones that block and wait for a result are, well, those getting the result back. So it may very well be that your timing of d_C->C
actually times transfers to, computations on, and transfer back from the device all together.
Since I just started looking at Julia+CUDA, I don't know for sure if that's what's happening here, but the symptoms totally look like it.
haha 1s @timholy :smile:
Man, our timing was down to the second.
Oh, you beat me on the second one.
Though none of us has suggested a function to call to manually sync the device/stream yet. 3..2..1..
You probably have to change your timing lines from @time copy!(d_B,B)
into something like
device_synchronize() ; @time (copy!(d_B,B) ; device_synchronize())
Edit: Just confirmed that this is indeed the case.
PS: I would be grateful of a copy of the full notebook (maybe as a gist) if possible, it looks like a good starting point for experimentation!
Thanks @timholy and @lucasb-eyer for your well-timed responses :) I'll try out the tests you suggest to see if it is really just that GPU is dispatching and returning on the BLAS call and then blocking the copy. That certainly makes the most sense to me.
@lucasb-eyer : Here is a link to the notebook.
Cheers!
Oh, you're looking to use cuDNN? Me too, we should share experiences somehow. Send me an e-mail if you're interested.
I had a question about the use of
CUDArt.jl
in conjunction withCuBLAS.jl
. I'm hoping to find out what I can do to improve the speed of memory copying from the GPU device back to CPU-accessible memory. I'm not sure if there is something I'm not implementing properly or if there is something intrinsic that I'm not properly understanding. I've conducted a series of tests using the BLAS Level-3 function,gemm
which I detail below (exporting my IJulia notebook over to Markdown...) I should note that I'm using Julia v0.4-rc2 for these experiments.Specifically, I've observed that for the implementation I detail below, the speed of the GPU->CPU
copy!
is about 2 orders of magnitude slower than the CPU->GPU copy. Surely I've done something wrong, right? If anyone could illuminate me on what I'm doing wrong, I'd be elated.Thanks!
Testing CuBLAS and CUDArt for Julia
After finally getting NVCC to work on OSX, we can start using the CUDA-themed BLAS packages written for Julia. In this notebook we will document how to utilize the necessary datatypes and show comparisons between the CPU and GPU implementations of common BLAS functions.
I. Calling and using the Libraries
Lets first make sure that we have updated and built the libraries. Because of the recent changes in Julia between
v0.3
andv0.4
, we expect quite a number of warnings, and even errors, to pop up during the testing phase. However, the core functionality of the packges should be there.II. Experiment Parameters
We will focus our comparisons on the BLAS function
gemm
which computes $$ \mathbf{C} \leftarrow \alpha \mathbf{A}\mathbf{B} + \beta \mathbf{C}.$$ We will assume that all of these matrices are dense and real. For our experiments we will set $\mathbf{A}: (n \times m)$, $\mathbf{B}: (m \times k)$, $\mathbf{C}: (n \times k)$, and $\alpha = \beta = 1.0$.III. Baseline Performance
We will now look at the timing of the base OpenBLAS implementation of
gemm
, which runs on the CPU, alone.IV. CUDArt Datatypes
Our first step in being able to use CuBLAS is to initialize our GPU device and make on-device copies of the datastructures we're interested in. Below we detail how to fence off the GPU code and ensure that proper garbage collection is performed on the device via CUDArt.
V. CuBLAS Timings
Now, lets look at the time requirements for just running
gemm
. We note that this does not include the time of memory copying to and from device memory. For now, lets limit ourselves to the direct comparison of the BLAS function implementation, alone.So, we can see form the above that we are looking at an order of magnitude improvement in computation time, potentially.
VI. CuBLAS Timings: With Memory Copying
We will now look at the situation where we want to declare a local function which will conduct all of the necessary device-to-device memory copying requried for the GPU implemenation. Our goal is to see exactly how much advantage we retain in a realistic comparison.
We can see that the act of reading the matrix $\mathbf{C}$ back from the device to the CPU actually incurs a huge cost. In fact, the cost is so high as to entirely remove any time advantage we obtain from the CuBLAS implemenation of
gemm
.