slow performance of `gemm_batch`

Tried CUBLAS.gemm_batch, it is much slower comparing to torch.bmm, any idea why?

julia> A = [rand(10, 10) for i in 1:2000];

julia> B = [rand(10, 10) for i in 1:2000];

julia> @benchmark gemm_batched(A, B) # this is [gemm('N', 'N', each_A, each_B) for (each_A, each_B) in zip(A, B)]
BenchmarkTools.Trial: 
  memory estimate:  1.72 MiB
  allocs estimate:  2003
  --------------
  minimum time:     687.255 μs (0.00% GC)
  median time:      697.382 μs (0.00% GC)
  mean time:        755.602 μs (6.75% GC)
  maximum time:     51.845 ms (98.48% GC)
  --------------
  samples:          6601
  evals/sample:     1

julia> dA = [CuArray{Float64}(each) for each in A];

julia> dB = [CuArray{Float64}(each) for each in B];

julia> @benchmark CuArrays.CUBLAS.gemm_batched('N', 'N', $dA, $dB)
BenchmarkTools.Trial: 
  memory estimate:  736.66 KiB
  allocs estimate:  32063
  --------------
  minimum time:     2.743 ms (0.00% GC)
  median time:      2.969 ms (0.00% GC)
  mean time:        4.826 ms (5.70% GC)
  maximum time:     85.273 ms (65.09% GC)
  --------------
  samples:          1036
  evals/sample:     1

And for PyTorch:

In [1]: import torch

In [2]: A = torch.rand(2000, 10, 10).cuda()

In [3]: B = torch.rand(2000, 10, 10).cuda()

In [4]: %timeit torch.bmm(A, B)
111 µs ± 179 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

This is because CuArrays currently wraps cuBLAS's gemmBatched (which takes an array of pointers to each array in the batch) rather than the faster gemmStridedBatched, which takes a single batch-major array (see https://devblogs.nvidia.com/cublas-strided-batched-matrix-multiply/).

I see, closed. dup of #93 #95

I have implemented the wrapper, but it is still 2x slower than PyTorch:

import torch
A = torch.rand(100, 10, 10)
B = torch.rand(100, 10, 10)
dA, dB = A.cuda(), B.cuda()

In [6]: %timeit torch.bmm(dA, dB)
9.35 µs ± 8.94 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

A = rand(Float32, 10, 10, 100)
B = rand(Float32, 10, 10, 100)
dA, dB = CuArray(A), CuArray(B)

@benchmark begin
    CuArrays.CUBLAS.gemm_strided_batched('N', 'N', dA, dB)
    CUDAdrv.synchronize()
end

BenchmarkTools.Trial:
  memory estimate:  208 bytes
  allocs estimate:  3
  --------------
  minimum time:     15.269 μs (0.00% GC)
  median time:      15.544 μs (0.00% GC)
  mean time:        15.598 μs (0.00% GC)
  maximum time:     52.408 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

I then ran profile with it

julia> function bench()
       for i in 1:100000
       CuArrays.CUBLAS.gemm_strided_batched!('N', 'N', 1.0f0, dA, dB, 1.0f0, dC)
       CUDAdrv.synchronize()
       end
       end
bench (generic function with 1 method)

julia> Profile.@profile bench()

julia> Profile.print()
1963 ./task.jl:259; (::getfield(REPL, Symbol("##26#27")){REPL.REPLBackend})()
 1963 /home/rluo/repo/julia/usr/share/julia/stdlib/v1.1/REPL/src/REPL.jl:117; macro expansion
  1963 /home/rluo/repo/julia/usr/share/julia/stdlib/v1.1/REPL/src/REPL.jl:85; eval_user_input(::Any, ::REPL.REPLBackend)
   1963 ./boot.jl:319; eval(::Module, ::Any)
    345  ./REPL[29]:3; bench()
     2   /home/rluo/.julia/packages/CuArrays/pDEXh/src/blas/error.jl:42; gemm_strided_batched!(::Char, ::Char, ::Float32, ::CuArray{Float32,3}, ::CuArray{Float32,3}, ::Float32, ::Cu...
     319 /home/rluo/.julia/packages/CuArrays/pDEXh/src/blas/error.jl:43; gemm_strided_batched!(::Char, ::Char, ::Float32, ::CuArray{Float32,3}, ::CuArray{Float32,3}, ::Float32, ::Cu...
     1   /home/rluo/.julia/packages/CuArrays/pDEXh/src/blas/error.jl:44; gemm_strided_batched!(::Char, ::Char, ::Float32, ::CuArray{Float32,3}, ::CuArray{Float32,3}, ::Float32, ::Cu...
     1   /home/rluo/.julia/packages/CuArrays/pDEXh/src/blas/wrappers.jl:1024; gemm_strided_batched!(::Char, ::Char, ::Float32, ::CuArray{Float32,3}, ::CuArray{Float32,3}, ::Float32, ::Cu...
      1 ./abstractarray.jl:38; size
       1 /home/rluo/.julia/packages/CuArrays/pDEXh/src/array.jl:53; size
        1 ./sysimg.jl:18; getproperty
     1   /home/rluo/.julia/packages/CuArrays/pDEXh/src/blas/wrappers.jl:1037; gemm_strided_batched!(::Char, ::Char, ::Float32, ::CuArray{Float32,3}, ::CuArray{Float32,3}, ::Float32, ::Cu...
      1 .../rluo/repo/julia/usr/share/julia/stdlib/v1.1/LinearAlgebra/src/dense.jl:105; stride
       1 ./abstractarray.jl:38; size
        1 /home/rluo/.julia/packages/CuArrays/pDEXh/src/array.jl:53; size
         1 ./sysimg.jl:18; getproperty
     1   /home/rluo/.julia/packages/CuArrays/pDEXh/src/blas/wrappers.jl:1039; gemm_strided_batched!(::Char, ::Char, ::Float32, ::CuArray{Float32,3}, ::CuArray{Float32,3}, ::Float32, ::Cu...
      1 .../rluo/repo/julia/usr/share/julia/stdlib/v1.1/LinearAlgebra/src/dense.jl:105; stride
       1 ./abstractarray.jl:38; size
        1 /home/rluo/.julia/packages/CuArrays/pDEXh/src/array.jl:53; size
         1 ./sysimg.jl:18; getproperty
     16  /home/rluo/.julia/packages/CuArrays/pDEXh/src/blas/wrappers.jl:1041; gemm_strided_batched!(::Char, ::Char, ::Float32, ::CuArray{Float32,3}, ::CuArray{Float32,3}, ::Float32, ::Cu...
      12 ./boot.jl:393; macro expansion
      4  /home/rluo/.julia/packages/CuArrays/pDEXh/src/blas/error.jl:43; macro expansion
       3 ./array.jl:777; vect
       1 /home/rluo/.julia/packages/CuArrays/pDEXh/src/array.jl:96; cconvert
        1 /home/rluo/.julia/packages/CuArrays/pDEXh/src/array.jl:90; buffer
         1 ./boot.jl:727; buffer
    1610 ./REPL[29]:4; bench()
     1610 /home/rluo/.julia/packages/CUDAdrv/LC5XS/src/context.jl:183; synchronize()
      7    /home/rluo/.julia/packages/CUDAdrv/LC5XS/src/context.jl:124; CuCurrentContext()
       7 /home/rluo/.julia/packages/CUDAdrv/LC5XS/src/base.jl:145; macro expansion
        7 /home/rluo/.julia/packages/CUDAdrv/LC5XS/src/context.jl:52; CuContext(::Ptr{Nothing}, ::Bool)
         7 ./dict.jl:448; get!(::getfield(CUDAdrv, Symbol("##5#6")){Ptr{Nothing},Bool}, ::Dict{Ptr{Nothing},CuContext}, ::Ptr{Nothing})
          7 ./dict.jl:309; ht_keyindex2!(::Dict{Ptr{Nothing},CuContext}, ::Ptr{Nothing})
           7 ./dict.jl:169; hashindex
            7 ./hashing.jl:18; hash
             7 ./hashing.jl:23; hash
              1 ./hashing.jl:63; hash_uint
               1 ./hashing.jl:35; hash_64_64
                1 ./int.jl:53; +
              6 ./reflection.jl:258; objectid
      1602 /home/rluo/.julia/packages/CUDAdrv/LC5XS/src/context.jl:183; synchronize(::CuContext)
       1602 /home/rluo/.julia/packages/CUDAdrv/LC5XS/src/base.jl:142; macro expansion
1    /home/rluo/.julia/packages/CUDAdrv/LC5XS/src/context.jl:124; CuCurrentContext()
 1 /home/rluo/.julia/packages/CUDAdrv/LC5XS/src/base.jl:145; macro expansion

Is this because synchronize or I wrapped the wrong way? I did some further benchmark (change the batch size to 2000), and it looks like this overhead is a constant:

julia> A = rand(Float32, 10, 10, 2000);

julia> B = rand(Float32, 10, 10, 2000);

julia> C = rand(Float32, 10, 10, 2000);

julia> dA, dB, dC = CuArray.([A, B, C]);

julia> @benchmark begin
       CuArrays.CUBLAS.gemm_strided_batched!('N', 'N', 1.0f0, dA, dB, 1.0f0, dC)
       CUDAdrv.synchronize()
       end
BenchmarkTools.Trial:
  memory estimate:  208 bytes
  allocs estimate:  3
  --------------
  minimum time:     120.416 μs (0.00% GC)
  median time:      121.331 μs (0.00% GC)
  mean time:        121.401 μs (0.00% GC)
  maximum time:     154.983 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

(PyTorch use 111 µs ± 179 ns per loop)

Please run both under nvprof.

import torch
A = torch.rand(100, 10, 10)
B = torch.rand(100, 10, 10)
dA, dB = A.cuda(), B.cuda()

torch.bmm(dA, dB)

==21636== NVPROF is profiling process 21636, command: python torch_bench.py
==21636== Profiling application: python torch_bench.py
==21636== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   61.81%  17.088us         1  17.088us  17.088us  17.088us  maxwell_sgemm_128x64_nn
                   38.19%  10.560us         3  3.5200us     896ns  4.8320us  [CUDA memcpy HtoD]
      API calls:   99.68%  3.20472s         4  801.18ms  10.336us  3.20440s  cudaMalloc
                    0.20%  6.4951ms         4  1.6238ms  509.25us  4.8989ms  cudaGetDeviceProperties
                    0.07%  2.2160ms       370  5.9890us     240ns  258.60us  cuDeviceGetAttribute
                    0.04%  1.3327ms         4  333.18us  302.69us  365.14us  cuDeviceTotalMem
                    0.01%  231.86us         4  57.965us  52.962us  63.836us  cuDeviceGetName
                    0.00%  31.633us         2  15.816us  11.895us  19.738us  cudaMemcpyAsync
                    0.00%  28.544us        24  1.1890us     238ns  15.302us  cudaGetDevice
                    0.00%  25.881us         1  25.881us  25.881us  25.881us  cudaLaunch
                    0.00%  18.221us         2  9.1100us  9.1040us  9.1170us  cudaStreamSynchronize
                    0.00%  13.766us         7  1.9660us     346ns  6.1240us  cudaSetDevice
                    0.00%  12.835us         1  12.835us  12.835us  12.835us  cudaMemcpy
                    0.00%  9.3650us        16     585ns     397ns  1.6960us  cudaEventCreateWithFlags
                    0.00%  7.3530us        19     387ns     193ns  1.6010us  cudaGetDeviceCount
                    0.00%  4.4120us        11     401ns     242ns  1.3000us  cudaDeviceGetAttribute
                    0.00%  3.4610us        24     144ns      99ns     662ns  cudaSetupArgument
                    0.00%  3.3710us         4     842ns     269ns  2.1210us  cuDeviceGetCount
                    0.00%  3.1270us         6     521ns     294ns  1.4030us  cuDeviceGet
                    0.00%  1.2490us         1  1.2490us  1.2490us  1.2490us  cudaFree
                    0.00%  1.0430us         1  1.0430us  1.0430us  1.0430us  cuInit
                    0.00%     857ns         1     857ns     857ns     857ns  cuDriverGetVersion
                    0.00%     576ns         1     576ns     576ns     576ns  cudaConfigureCall
                    0.00%     191ns         1     191ns     191ns     191ns  cudaGetLastError

Julia

using CuArrays
A = rand(Float32, 10, 10, 100)
B = rand(Float32, 10, 10, 100)
dA, dB = CuArray(A), CuArray(B)
CuArrays.CUBLAS.gemm_strided_batched('N', 'N', dA, dB)

==22379== NVPROF is profiling process 22379, command: julia gemm_strided.jl
==22379== Profiling application: julia gemm_strided.jl
==22379== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   62.30%  17.505us         1  17.505us  17.505us  17.505us  maxwell_sgemm_128x64_nn
                   37.70%  10.592us         3  3.5300us     960ns  4.8320us  [CUDA memcpy HtoD]
      API calls:   45.21%  216.35ms         4  54.086ms  7.2470us  216.19ms  cudaFree
                   31.48%  150.64ms         1  150.64ms  150.64ms  150.64ms  cuCtxCreate
                   22.91%  109.62ms         1  109.62ms  109.62ms  109.62ms  cuCtxDestroy
                    0.20%  954.27us       188  5.0750us     117ns  317.45us  cuDeviceGetAttribute
                    0.07%  318.43us         2  159.22us  155.38us  163.06us  cuDeviceTotalMem
                    0.06%  272.94us         3  90.980us  15.688us  236.50us  cuMemAlloc
                    0.03%  160.79us         3  53.596us  5.7750us  146.21us  cudaMalloc
                    0.02%  86.424us         2  43.212us  33.910us  52.514us  cuDeviceGetName
                    0.01%  51.739us         2  25.869us  12.802us  38.937us  cuMemcpyHtoD
                    0.01%  24.225us         1  24.225us  24.225us  24.225us  cudaLaunch
                    0.00%  15.208us         1  15.208us  15.208us  15.208us  cudaMemcpy
                    0.00%  10.669us        16     666ns     366ns  4.1770us  cudaEventDestroy
                    0.00%  8.5600us        16     535ns     428ns  1.7530us  cudaEventCreateWithFlags
                    0.00%  6.7840us         2  3.3920us  3.2300us  3.5540us  cudaThreadSynchronize
                    0.00%  5.1900us        24     216ns     112ns  2.0130us  cudaSetupArgument
                    0.00%  4.7450us        11     431ns     256ns  1.5150us  cudaDeviceGetAttribute
                    0.00%  3.5020us         1  3.5020us  3.5020us  3.5020us  cudaConfigureCall
                    0.00%  2.8270us         5     565ns     198ns  1.4530us  cuDeviceGet
                    0.00%  2.0620us         4     515ns     198ns  1.2520us  cuCtxGetCurrent
                    0.00%  1.7170us         1  1.7170us  1.7170us  1.7170us  cudaGetDevice
                    0.00%  1.6410us         2     820ns     509ns  1.1320us  cuDriverGetVersion
                    0.00%  1.4980us         3     499ns     112ns  1.1130us  cuDeviceGetCount
                    0.00%     543ns         1     543ns     543ns     543ns  cuInit
                    0.00%     282ns         1     282ns     282ns     282ns  cudaGetLastError

I'm not seeing that. Here's the scripts I'm using:

using CuArrays
import CUDAdrv
import NVTX

const A = rand(Float32, 10, 10, 100)
const B = rand(Float32, 10, 10, 100)
const dA, dB = CuArray(A), CuArray(B)

CuArrays.CUBLAS.gemm_strided_batched('N', 'N', dA, dB)

NVTX.@activate CUDAdrv.@profile begin
    for i in 1:100
        GC.gc(true)
        NVTX.@range "host" begin
            CuArrays.CUBLAS.gemm_strided_batched('N', 'N', dA, dB)
            CUDAdrv.synchronize()
        end
    end
end

import gc

import ctypes
nvtx = ctypes.CDLL("/opt/cuda/lib64/libnvToolsExt.so.1.0.0")

import pycuda
import pycuda.autoinit

import torch
A = torch.rand(100, 10, 10)
B = torch.rand(100, 10, 10)
dA, dB = A.cuda(), B.cuda()

# warmup
torch.bmm(dA, dB)

pycuda.driver.start_profiler()
for i in range(1,101):
    gc.collect()
    nvtx.nvtxRangePushA(ctypes.c_char_p(b"host"))
    torch.bmm(dA, dB)
    pycuda.driver.Context.synchronize()
    nvtx.nvtxRangePop()
pycuda.driver.stop_profiler()

You should take care when benchmarking code like that: You're not synchronizing the device, using non-constant globals, only running the test once without warm-up, and using non-implace ops that trigger the memory allocator (left that alone because it doesn't matter in this case, but it often does since our allocator isn't great yet).

Now, running this under nvprof:

==23773== NVPROF is profiling process 23773, command: julia test.jl
==23773== Profiling application: julia test.jl
==23773== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  2.3779ms       100  23.778us  23.296us  32.640us  maxwell_sgemm_128x64_nn
      API calls:   97.36%  92.471ms       100  924.71us  29.693us  89.066ms  cudaLaunchKernel
                    2.51%  2.3834ms       100  23.833us  20.153us  65.525us  cuCtxSynchronize
                    0.06%  52.474us       101     519ns     398ns     780ns  cuCtxGetCurrent
                    0.04%  34.021us       100     340ns     229ns  1.0830us  cudaGetLastError
                    0.03%  31.861us         1  31.861us  31.861us  31.861us  cuMemAlloc
                    0.00%  1.9580us         1  1.9580us  1.9580us  1.9580us  cuDeviceGetCount

==23773== NVTX result:
==23773==   Thread "<unnamed>" (id = 697388224)
==23773==     Domain "<unnamed>"
==23773==       Range "host"
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
          Range:  100.00%  103.52ms       100  1.0352ms  68.040us  96.115ms  host
 GPU activities:  100.00%  2.3779ms       100  23.778us  23.296us  32.640us  maxwell_sgemm_128x64_nn
      API calls:  100.00%  92.471ms       100  924.71us  29.693us  89.066ms  cudaLaunchKernel

==23750== NVPROF is profiling process 23750, command: python test.py
==23750== Profiling application: python test.py
==23750== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  2.3745ms       100  23.745us  23.263us  31.775us  maxwell_sgemm_128x64_nn
      API calls:   99.57%  684.76ms       100  6.8476ms  30.169us  681.30ms  cudaLaunchKernel
                    0.19%  1.3045ms       100  13.044us  4.9430us  24.733us  cuCtxSynchronize
                    0.17%  1.1561ms      1100  1.0510us     644ns  20.344us  cudaGetDevice
                    0.06%  440.15us       300  1.4670us     673ns  19.476us  cudaSetDevice
                    0.01%  47.054us       100     470ns     211ns  12.476us  cudaGetLastError
                    0.00%  1.8030us         1  1.8030us  1.8030us  1.8030us  cuDeviceGetCount

==23750== NVTX result:
==23750==   Thread "<unnamed>" (id = 3543764096)
==23750==     Domain "<unnamed>"
==23750==       Range "host"
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
          Range:  100.00%  692.98ms       100  6.9298ms  103.90us  681.43ms  host
 GPU activities:  100.00%  2.3745ms       100  23.745us  23.263us  31.775us  maxwell_sgemm_128x64_nn
      API calls:  100.00%  684.76ms       100  6.8476ms  30.169us  681.30ms  cudaLaunchKernel

The NVTX host range shows total host time, and is great for comparing both versions. I'm not sure why cudaLaunchKernel occasionally takes ages, but CUBLAS is a black box so nothing to do about that. Looking at the averages and the minimal times, it seems like the Julia implementation performs just great? Any we're calling the same kernel so our API usage looks OK as well.

Not to say that there might be a performance difference in your case (might depend on your GPU, CUDA version, etc). If there is, just reopen an issue :slightly_smiling_face:

JuliaGPU / CuArrays.jl

slow performance of `gemm_batch` #182