eric-tramel commented 9 years ago

I had a question about the use of CUDArt.jl in conjunction with CuBLAS.jl. I'm hoping to find out what I can do to improve the speed of memory copying from the GPU device back to CPU-accessible memory. I'm not sure if there is something I'm not implementing properly or if there is something intrinsic that I'm not properly understanding. I've conducted a series of tests using the BLAS Level-3 function, gemm which I detail below (exporting my IJulia notebook over to Markdown...) I should note that I'm using Julia v0.4-rc2 for these experiments.

Specifically, I've observed that for the implementation I detail below, the speed of the GPU->CPU copy! is about 2 orders of magnitude slower than the CPU->GPU copy. Surely I've done something wrong, right? If anyone could illuminate me on what I'm doing wrong, I'd be elated.

Thanks!

Testing CuBLAS and CUDArt for Julia

After finally getting NVCC to work on OSX, we can start using the CUDA-themed BLAS packages written for Julia. In this notebook we will document how to utilize the necessary datatypes and show comparisons between the CPU and GPU implementations of common BLAS functions.

I. Calling and using the Libraries

Lets first make sure that we have updated and built the libraries. Because of the recent changes in Julia between v0.3 and v0.4, we expect quite a number of warnings, and even errors, to pop up during the testing phase. However, the core functionality of the packges should be there.

# Update and build
Pkg.update()
Pkg.build("CUDArt")
Pkg.build("CUBLAS")

using CUDArt
using CUBLAS
using Base.LinAlg.BLAS

II. Experiment Parameters

We will focus our comparisons on the BLAS function gemm which computes $$ \mathbf{C} \leftarrow \alpha \mathbf{A}\mathbf{B} + \beta \mathbf{C}.$$ We will assume that all of these matrices are dense and real. For our experiments we will set $\mathbf{A}: (n \times m)$, $\mathbf{B}: (m \times k)$, $\mathbf{C}: (n \times k)$, and $\alpha = \beta = 1.0$.

# Dimensions
n = 256
m = 784
k = 100
# Scalings
a = 1.0
b = 1.0
# Initialization
A = randn(n,m);
B = randn(m,k);
C = randn(n,k);
whos()

                             A   1568 KB     256x784 Array{Float64,2} : [1.7596…
                             B    612 KB     784x100 Array{Float64,2} : [-1.596…
                          Base  26665 KB     Module : Base
                             C    200 KB     256x100 Array{Float64,2} : [-0.344…
                        CUBLAS    545 KB     Module : CUBLAS
                        CUDArt    573 KB     Module : CUDArt
                        Compat     58 KB     Module : Compat
                          Core   3218 KB     Module : Core
                DataStructures    337 KB     Module : DataStructures
                        IJulia    368 KB     Module : IJulia
                IPythonDisplay     26 KB     Module : IPythonDisplay
                          JSON    195 KB     Module : JSON
                          Main  33724 KB     Module : Main
                       MyGemm!    966 bytes  Function : MyGemm!
                  MyTimedGemm!     15 KB     Function : MyTimedGemm!
                        Nettle    187 KB     Module : Nettle
                           ZMQ     80 KB     Module : ZMQ
                             a      8 bytes  Float64 : 1.0
                             b      8 bytes  Float64 : 1.0
                           d_A     40 bytes  CUDArt.CudaArray{Float64,2}(CUDArt…
                           d_B     40 bytes  CUDArt.CudaArray{Float64,2}(CUDArt…
                           d_C     40 bytes  CUDArt.CudaArray{Float64,2}(CUDArt…
                           dev      8 bytes  Int64 : 0
                             k      8 bytes  Int64 : 100
                             m      8 bytes  Int64 : 784
                             n      8 bytes  Int64 : 256
                        result      0 bytes  Void : nothing

III. Baseline Performance

We will now look at the timing of the base OpenBLAS implementation of gemm, which runs on the CPU, alone.

# Warmpup
gemm!('N','N',a,A,B,b,C);
gemm!('N','N',a,A,B,b,C);
# Time: 5 runs
@time gemm!('N','N',a,A,B,b,C);
@time gemm!('N','N',a,A,B,b,C);
@time gemm!('N','N',a,A,B,b,C);
@time gemm!('N','N',a,A,B,b,C);
@time gemm!('N','N',a,A,B,b,C);

  0.000769 seconds (4 allocations: 160 bytes)
  0.000797 seconds (4 allocations: 160 bytes)
  0.000810 seconds (4 allocations: 160 bytes)
  0.000917 seconds (4 allocations: 160 bytes)
  0.001528 seconds (4 allocations: 160 bytes)

IV. CUDArt Datatypes

Our first step in being able to use CuBLAS is to initialize our GPU device and make on-device copies of the datastructures we're interested in. Below we detail how to fence off the GPU code and ensure that proper garbage collection is performed on the device via CUDArt.

# Assign Device
device(0)
device_reset(0) 
device(0)
# Create and Copy "A"
d_A = CudaArray(A)
copy!(d_A,A)
# Create and Copy "B"
d_B = CudaArray(B)
copy!(d_B,B)
# Create and Copy "C"
d_C = CudaArray(C)
copy!(d_C,C)
# Show 
println("CUDArt Data Pointer Descriptions")
println(d_A)
println(d_B)
println(d_C)

CUDArt Data Pointer Descriptions
CUDArt.CudaArray{Float64,2}(CUDArt.CudaPtr{Float64}(Ptr{Float64} @0x0000000d00a80000),(256,784),0)
CUDArt.CudaArray{Float64,2}(CUDArt.CudaPtr{Float64}(Ptr{Float64} @0x0000000d00c20000),(784,100),0)
CUDArt.CudaArray{Float64,2}(CUDArt.CudaPtr{Float64}(Ptr{Float64} @0x0000000d00d20000),(256,100),0)

V. CuBLAS Timings

Now, lets look at the time requirements for just running gemm. We note that this does not include the time of memory copying to and from device memory. For now, lets limit ourselves to the direct comparison of the BLAS function implementation, alone.

# Warmpup
CUBLAS.gemm!('N','N',a,d_A,d_B,b,d_C);
CUBLAS.gemm!('N','N',a,d_A,d_B,b,d_C);
# Time: 5 runs
@time CUBLAS.gemm!('N','N',a,d_A,d_B,b,d_C);
@time CUBLAS.gemm!('N','N',a,d_A,d_B,b,d_C);
@time CUBLAS.gemm!('N','N',a,d_A,d_B,b,d_C);
@time CUBLAS.gemm!('N','N',a,d_A,d_B,b,d_C);
@time CUBLAS.gemm!('N','N',a,d_A,d_B,b,d_C);

  0.000033 seconds (24 allocations: 1.016 KB)
  0.000053 seconds (24 allocations: 1.016 KB)
  0.000053 seconds (24 allocations: 1.016 KB)
  0.000045 seconds (24 allocations: 1.016 KB)
  0.000037 seconds (24 allocations: 1.016 KB)

So, we can see form the above that we are looking at an order of magnitude improvement in computation time, potentially.

# End Session
device_reset(0)

VI. CuBLAS Timings: With Memory Copying

We will now look at the situation where we want to declare a local function which will conduct all of the necessary device-to-device memory copying requried for the GPU implemenation. Our goal is to see exactly how much advantage we retain in a realistic comparison.

function MyTimedGemm!(tA,tB,a,A,d_A,B,d_B,b,C,d_C)
    # Copy to device
    @printf "(A->d_A)       " 
        @time copy!(d_A,A)
    @printf "(B->d_B)       " 
        @time copy!(d_B,B)
    @printf "(C->d_C)       " 
        @time copy!(d_C,C)
    # Run device-level BLAS
    @printf "(CUBLAS.gemm!) "
        @time CUBLAS.gemm!(tA,tB,a,d_A,d_B,b,d_C)
    # Gather result
    @printf "(d_C->C)       "
        @time copy!(C,d_C)
end

device(0)
device_reset(0)
device(0)

# These pointers can be pre-allocated
d_A = CudaArray(A)
d_B = CudaArray(B)
d_C = CudaArray(C)

# Warmup
println("Warmups============")
MyTimedGemm!('N','N',a,A,d_A,B,d_B,b,C,d_C);
MyTimedGemm!('N','N',a,A,d_A,B,d_B,b,C,d_C);
println("Actual=============")
@time MyTimedGemm!('N','N',a,A,d_A,B,d_B,b,C,d_C);

Warmups============
(A->d_A)         0.000204 seconds
(B->d_B)         0.000317 seconds
(C->d_C)         0.000075 seconds
(CUBLAS.gemm!)   0.078434 seconds (20 allocations: 880 bytes)
(d_C->C)         0.006428 seconds
(A->d_A)         0.000442 seconds
(B->d_B)         0.000361 seconds
(C->d_C)         0.000063 seconds
(CUBLAS.gemm!)   0.000043 seconds (20 allocations: 880 bytes)
(d_C->C)         0.006849 seconds
Actual=============
(A->d_A)         0.000214 seconds
(B->d_B)         0.000307 seconds
(C->d_C)         0.000076 seconds
(CUBLAS.gemm!)   0.000038 seconds (20 allocations: 880 bytes)
(d_C->C)         0.007070 seconds
  0.008016 seconds (199 allocations: 7.813 KB)

We can see that the act of reading the matrix $\mathbf{C}$ back from the device to the CPU actually incurs a huge cost. In fact, the cost is so high as to entirely remove any time advantage we obtain from the CuBLAS implemenation of gemm.

timholy commented 9 years ago

Nicely documented! Made it very easy to see what you're doing and what you are concerned about.

I am not certain what's going on, but here's my hunch: is it possible that CUBLAS.gemm! is asynchronous? Meaning, it returns immediately but the GPU is not yet done with the computation? And then your copy-back-to-CPU instruction blocks until it's done---therefore the copy-back is being blamed for time that's actually spent computing the product.

Also, a couple of possible efficiency improvements:

d_A = CudaArray(A) already copies the data, no need to call copy! thereafter. See https://github.com/JuliaGPU/CUDArt.jl/blob/c7954d7a931fcf48e017a943e4ce4cd13ae6cc48/src/arrays.jl#L103.
Do you need to initialize d_C? If so, can you just call fill!(d_C, 0) instead?

lucasb-eyer commented 9 years ago

A wild guess, since I literally started looking at CUDA+Julia about an hour ago, and this is a hunch from memory of old times spent in CUDA+C.

Typically, what you observe is exactly what one observes when not synchronizing device and host before and after timing. Without synchronizing, a lot of CUDA methods just dispatch the work and return almost immediately, the only ones that block and wait for a result are, well, those getting the result back. So it may very well be that your timing of d_C->C actually times transfers to, computations on, and transfer back from the device all together.

Since I just started looking at Julia+CUDA, I don't know for sure if that's what's happening here, but the symptoms totally look like it.

lucasb-eyer commented 9 years ago

haha 1s @timholy :smile:

timholy commented 9 years ago

Man, our timing was down to the second.

timholy commented 9 years ago

Oh, you beat me on the second one.

lucasb-eyer commented 9 years ago

Though none of us has suggested a function to call to manually sync the device/stream yet. 3..2..1..

lucasb-eyer commented 9 years ago

You probably have to change your timing lines from @time copy!(d_B,B) into something like

device_synchronize() ; @time (copy!(d_B,B) ; device_synchronize())

Edit: Just confirmed that this is indeed the case.

PS: I would be grateful of a copy of the full notebook (maybe as a gist) if possible, it looks like a good starting point for experimentation!

eric-tramel commented 9 years ago

Thanks @timholy and @lucasb-eyer for your well-timed responses :) I'll try out the tests you suggest to see if it is really just that GPU is dispatching and returning on the BLAS call and then blocking the copy. That certainly makes the most sense to me.

@lucasb-eyer : Here is a link to the notebook.

lucasb-eyer commented 9 years ago

Cheers!

lucasb-eyer commented 9 years ago

Oh, you're looking to use cuDNN? Me too, we should share experiences somehow. Send me an e-mail if you're interested.

JuliaAttic / CUDArt.jl

[Question] GPU->CPU Copy Speed? #30

Testing CuBLAS and CUDArt for Julia

I. Calling and using the Libraries

II. Experiment Parameters

III. Baseline Performance

IV. CUDArt Datatypes

V. CuBLAS Timings

VI. CuBLAS Timings: With Memory Copying