Closed Roger-luo closed 5 years ago
This is because CuArrays currently wraps cuBLAS's gemmBatched
(which takes an array of pointers to each array in the batch) rather than the faster gemmStridedBatched
, which takes a single batch-major array (see https://devblogs.nvidia.com/cublas-strided-batched-matrix-multiply/).
I see, closed. dup of #93 #95
I have implemented the wrapper, but it is still 2x slower than PyTorch:
import torch
A = torch.rand(100, 10, 10)
B = torch.rand(100, 10, 10)
dA, dB = A.cuda(), B.cuda()
In [6]: %timeit torch.bmm(dA, dB)
9.35 µs ± 8.94 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
A = rand(Float32, 10, 10, 100)
B = rand(Float32, 10, 10, 100)
dA, dB = CuArray(A), CuArray(B)
@benchmark begin
CuArrays.CUBLAS.gemm_strided_batched('N', 'N', dA, dB)
CUDAdrv.synchronize()
end
BenchmarkTools.Trial:
memory estimate: 208 bytes
allocs estimate: 3
--------------
minimum time: 15.269 μs (0.00% GC)
median time: 15.544 μs (0.00% GC)
mean time: 15.598 μs (0.00% GC)
maximum time: 52.408 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
I then ran profile with it
julia> function bench()
for i in 1:100000
CuArrays.CUBLAS.gemm_strided_batched!('N', 'N', 1.0f0, dA, dB, 1.0f0, dC)
CUDAdrv.synchronize()
end
end
bench (generic function with 1 method)
julia> Profile.@profile bench()
julia> Profile.print()
1963 ./task.jl:259; (::getfield(REPL, Symbol("##26#27")){REPL.REPLBackend})()
1963 /home/rluo/repo/julia/usr/share/julia/stdlib/v1.1/REPL/src/REPL.jl:117; macro expansion
1963 /home/rluo/repo/julia/usr/share/julia/stdlib/v1.1/REPL/src/REPL.jl:85; eval_user_input(::Any, ::REPL.REPLBackend)
1963 ./boot.jl:319; eval(::Module, ::Any)
345 ./REPL[29]:3; bench()
2 /home/rluo/.julia/packages/CuArrays/pDEXh/src/blas/error.jl:42; gemm_strided_batched!(::Char, ::Char, ::Float32, ::CuArray{Float32,3}, ::CuArray{Float32,3}, ::Float32, ::Cu...
319 /home/rluo/.julia/packages/CuArrays/pDEXh/src/blas/error.jl:43; gemm_strided_batched!(::Char, ::Char, ::Float32, ::CuArray{Float32,3}, ::CuArray{Float32,3}, ::Float32, ::Cu...
1 /home/rluo/.julia/packages/CuArrays/pDEXh/src/blas/error.jl:44; gemm_strided_batched!(::Char, ::Char, ::Float32, ::CuArray{Float32,3}, ::CuArray{Float32,3}, ::Float32, ::Cu...
1 /home/rluo/.julia/packages/CuArrays/pDEXh/src/blas/wrappers.jl:1024; gemm_strided_batched!(::Char, ::Char, ::Float32, ::CuArray{Float32,3}, ::CuArray{Float32,3}, ::Float32, ::Cu...
1 ./abstractarray.jl:38; size
1 /home/rluo/.julia/packages/CuArrays/pDEXh/src/array.jl:53; size
1 ./sysimg.jl:18; getproperty
1 /home/rluo/.julia/packages/CuArrays/pDEXh/src/blas/wrappers.jl:1037; gemm_strided_batched!(::Char, ::Char, ::Float32, ::CuArray{Float32,3}, ::CuArray{Float32,3}, ::Float32, ::Cu...
1 .../rluo/repo/julia/usr/share/julia/stdlib/v1.1/LinearAlgebra/src/dense.jl:105; stride
1 ./abstractarray.jl:38; size
1 /home/rluo/.julia/packages/CuArrays/pDEXh/src/array.jl:53; size
1 ./sysimg.jl:18; getproperty
1 /home/rluo/.julia/packages/CuArrays/pDEXh/src/blas/wrappers.jl:1039; gemm_strided_batched!(::Char, ::Char, ::Float32, ::CuArray{Float32,3}, ::CuArray{Float32,3}, ::Float32, ::Cu...
1 .../rluo/repo/julia/usr/share/julia/stdlib/v1.1/LinearAlgebra/src/dense.jl:105; stride
1 ./abstractarray.jl:38; size
1 /home/rluo/.julia/packages/CuArrays/pDEXh/src/array.jl:53; size
1 ./sysimg.jl:18; getproperty
16 /home/rluo/.julia/packages/CuArrays/pDEXh/src/blas/wrappers.jl:1041; gemm_strided_batched!(::Char, ::Char, ::Float32, ::CuArray{Float32,3}, ::CuArray{Float32,3}, ::Float32, ::Cu...
12 ./boot.jl:393; macro expansion
4 /home/rluo/.julia/packages/CuArrays/pDEXh/src/blas/error.jl:43; macro expansion
3 ./array.jl:777; vect
1 /home/rluo/.julia/packages/CuArrays/pDEXh/src/array.jl:96; cconvert
1 /home/rluo/.julia/packages/CuArrays/pDEXh/src/array.jl:90; buffer
1 ./boot.jl:727; buffer
1610 ./REPL[29]:4; bench()
1610 /home/rluo/.julia/packages/CUDAdrv/LC5XS/src/context.jl:183; synchronize()
7 /home/rluo/.julia/packages/CUDAdrv/LC5XS/src/context.jl:124; CuCurrentContext()
7 /home/rluo/.julia/packages/CUDAdrv/LC5XS/src/base.jl:145; macro expansion
7 /home/rluo/.julia/packages/CUDAdrv/LC5XS/src/context.jl:52; CuContext(::Ptr{Nothing}, ::Bool)
7 ./dict.jl:448; get!(::getfield(CUDAdrv, Symbol("##5#6")){Ptr{Nothing},Bool}, ::Dict{Ptr{Nothing},CuContext}, ::Ptr{Nothing})
7 ./dict.jl:309; ht_keyindex2!(::Dict{Ptr{Nothing},CuContext}, ::Ptr{Nothing})
7 ./dict.jl:169; hashindex
7 ./hashing.jl:18; hash
7 ./hashing.jl:23; hash
1 ./hashing.jl:63; hash_uint
1 ./hashing.jl:35; hash_64_64
1 ./int.jl:53; +
6 ./reflection.jl:258; objectid
1602 /home/rluo/.julia/packages/CUDAdrv/LC5XS/src/context.jl:183; synchronize(::CuContext)
1602 /home/rluo/.julia/packages/CUDAdrv/LC5XS/src/base.jl:142; macro expansion
1 /home/rluo/.julia/packages/CUDAdrv/LC5XS/src/context.jl:124; CuCurrentContext()
1 /home/rluo/.julia/packages/CUDAdrv/LC5XS/src/base.jl:145; macro expansion
Is this because synchronize
or I wrapped the wrong way? I did some further benchmark (change the batch size to 2000), and it looks like this overhead is a constant:
julia> A = rand(Float32, 10, 10, 2000);
julia> B = rand(Float32, 10, 10, 2000);
julia> C = rand(Float32, 10, 10, 2000);
julia> dA, dB, dC = CuArray.([A, B, C]);
julia> @benchmark begin
CuArrays.CUBLAS.gemm_strided_batched!('N', 'N', 1.0f0, dA, dB, 1.0f0, dC)
CUDAdrv.synchronize()
end
BenchmarkTools.Trial:
memory estimate: 208 bytes
allocs estimate: 3
--------------
minimum time: 120.416 μs (0.00% GC)
median time: 121.331 μs (0.00% GC)
mean time: 121.401 μs (0.00% GC)
maximum time: 154.983 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
(PyTorch use 111 µs ± 179 ns per loop
)
Please run both under nvprof
.
import torch
A = torch.rand(100, 10, 10)
B = torch.rand(100, 10, 10)
dA, dB = A.cuda(), B.cuda()
torch.bmm(dA, dB)
==21636== NVPROF is profiling process 21636, command: python torch_bench.py
==21636== Profiling application: python torch_bench.py
==21636== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 61.81% 17.088us 1 17.088us 17.088us 17.088us maxwell_sgemm_128x64_nn
38.19% 10.560us 3 3.5200us 896ns 4.8320us [CUDA memcpy HtoD]
API calls: 99.68% 3.20472s 4 801.18ms 10.336us 3.20440s cudaMalloc
0.20% 6.4951ms 4 1.6238ms 509.25us 4.8989ms cudaGetDeviceProperties
0.07% 2.2160ms 370 5.9890us 240ns 258.60us cuDeviceGetAttribute
0.04% 1.3327ms 4 333.18us 302.69us 365.14us cuDeviceTotalMem
0.01% 231.86us 4 57.965us 52.962us 63.836us cuDeviceGetName
0.00% 31.633us 2 15.816us 11.895us 19.738us cudaMemcpyAsync
0.00% 28.544us 24 1.1890us 238ns 15.302us cudaGetDevice
0.00% 25.881us 1 25.881us 25.881us 25.881us cudaLaunch
0.00% 18.221us 2 9.1100us 9.1040us 9.1170us cudaStreamSynchronize
0.00% 13.766us 7 1.9660us 346ns 6.1240us cudaSetDevice
0.00% 12.835us 1 12.835us 12.835us 12.835us cudaMemcpy
0.00% 9.3650us 16 585ns 397ns 1.6960us cudaEventCreateWithFlags
0.00% 7.3530us 19 387ns 193ns 1.6010us cudaGetDeviceCount
0.00% 4.4120us 11 401ns 242ns 1.3000us cudaDeviceGetAttribute
0.00% 3.4610us 24 144ns 99ns 662ns cudaSetupArgument
0.00% 3.3710us 4 842ns 269ns 2.1210us cuDeviceGetCount
0.00% 3.1270us 6 521ns 294ns 1.4030us cuDeviceGet
0.00% 1.2490us 1 1.2490us 1.2490us 1.2490us cudaFree
0.00% 1.0430us 1 1.0430us 1.0430us 1.0430us cuInit
0.00% 857ns 1 857ns 857ns 857ns cuDriverGetVersion
0.00% 576ns 1 576ns 576ns 576ns cudaConfigureCall
0.00% 191ns 1 191ns 191ns 191ns cudaGetLastError
Julia
using CuArrays
A = rand(Float32, 10, 10, 100)
B = rand(Float32, 10, 10, 100)
dA, dB = CuArray(A), CuArray(B)
CuArrays.CUBLAS.gemm_strided_batched('N', 'N', dA, dB)
==22379== NVPROF is profiling process 22379, command: julia gemm_strided.jl
==22379== Profiling application: julia gemm_strided.jl
==22379== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 62.30% 17.505us 1 17.505us 17.505us 17.505us maxwell_sgemm_128x64_nn
37.70% 10.592us 3 3.5300us 960ns 4.8320us [CUDA memcpy HtoD]
API calls: 45.21% 216.35ms 4 54.086ms 7.2470us 216.19ms cudaFree
31.48% 150.64ms 1 150.64ms 150.64ms 150.64ms cuCtxCreate
22.91% 109.62ms 1 109.62ms 109.62ms 109.62ms cuCtxDestroy
0.20% 954.27us 188 5.0750us 117ns 317.45us cuDeviceGetAttribute
0.07% 318.43us 2 159.22us 155.38us 163.06us cuDeviceTotalMem
0.06% 272.94us 3 90.980us 15.688us 236.50us cuMemAlloc
0.03% 160.79us 3 53.596us 5.7750us 146.21us cudaMalloc
0.02% 86.424us 2 43.212us 33.910us 52.514us cuDeviceGetName
0.01% 51.739us 2 25.869us 12.802us 38.937us cuMemcpyHtoD
0.01% 24.225us 1 24.225us 24.225us 24.225us cudaLaunch
0.00% 15.208us 1 15.208us 15.208us 15.208us cudaMemcpy
0.00% 10.669us 16 666ns 366ns 4.1770us cudaEventDestroy
0.00% 8.5600us 16 535ns 428ns 1.7530us cudaEventCreateWithFlags
0.00% 6.7840us 2 3.3920us 3.2300us 3.5540us cudaThreadSynchronize
0.00% 5.1900us 24 216ns 112ns 2.0130us cudaSetupArgument
0.00% 4.7450us 11 431ns 256ns 1.5150us cudaDeviceGetAttribute
0.00% 3.5020us 1 3.5020us 3.5020us 3.5020us cudaConfigureCall
0.00% 2.8270us 5 565ns 198ns 1.4530us cuDeviceGet
0.00% 2.0620us 4 515ns 198ns 1.2520us cuCtxGetCurrent
0.00% 1.7170us 1 1.7170us 1.7170us 1.7170us cudaGetDevice
0.00% 1.6410us 2 820ns 509ns 1.1320us cuDriverGetVersion
0.00% 1.4980us 3 499ns 112ns 1.1130us cuDeviceGetCount
0.00% 543ns 1 543ns 543ns 543ns cuInit
0.00% 282ns 1 282ns 282ns 282ns cudaGetLastError
I'm not seeing that. Here's the scripts I'm using:
using CuArrays
import CUDAdrv
import NVTX
const A = rand(Float32, 10, 10, 100)
const B = rand(Float32, 10, 10, 100)
const dA, dB = CuArray(A), CuArray(B)
CuArrays.CUBLAS.gemm_strided_batched('N', 'N', dA, dB)
NVTX.@activate CUDAdrv.@profile begin
for i in 1:100
GC.gc(true)
NVTX.@range "host" begin
CuArrays.CUBLAS.gemm_strided_batched('N', 'N', dA, dB)
CUDAdrv.synchronize()
end
end
end
import gc
import ctypes
nvtx = ctypes.CDLL("/opt/cuda/lib64/libnvToolsExt.so.1.0.0")
import pycuda
import pycuda.autoinit
import torch
A = torch.rand(100, 10, 10)
B = torch.rand(100, 10, 10)
dA, dB = A.cuda(), B.cuda()
# warmup
torch.bmm(dA, dB)
pycuda.driver.start_profiler()
for i in range(1,101):
gc.collect()
nvtx.nvtxRangePushA(ctypes.c_char_p(b"host"))
torch.bmm(dA, dB)
pycuda.driver.Context.synchronize()
nvtx.nvtxRangePop()
pycuda.driver.stop_profiler()
You should take care when benchmarking code like that: You're not synchronizing the device, using non-constant globals, only running the test once without warm-up, and using non-implace ops that trigger the memory allocator (left that alone because it doesn't matter in this case, but it often does since our allocator isn't great yet).
Now, running this under nvprof
:
==23773== NVPROF is profiling process 23773, command: julia test.jl
==23773== Profiling application: julia test.jl
==23773== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 100.00% 2.3779ms 100 23.778us 23.296us 32.640us maxwell_sgemm_128x64_nn
API calls: 97.36% 92.471ms 100 924.71us 29.693us 89.066ms cudaLaunchKernel
2.51% 2.3834ms 100 23.833us 20.153us 65.525us cuCtxSynchronize
0.06% 52.474us 101 519ns 398ns 780ns cuCtxGetCurrent
0.04% 34.021us 100 340ns 229ns 1.0830us cudaGetLastError
0.03% 31.861us 1 31.861us 31.861us 31.861us cuMemAlloc
0.00% 1.9580us 1 1.9580us 1.9580us 1.9580us cuDeviceGetCount
==23773== NVTX result:
==23773== Thread "<unnamed>" (id = 697388224)
==23773== Domain "<unnamed>"
==23773== Range "host"
Type Time(%) Time Calls Avg Min Max Name
Range: 100.00% 103.52ms 100 1.0352ms 68.040us 96.115ms host
GPU activities: 100.00% 2.3779ms 100 23.778us 23.296us 32.640us maxwell_sgemm_128x64_nn
API calls: 100.00% 92.471ms 100 924.71us 29.693us 89.066ms cudaLaunchKernel
==23750== NVPROF is profiling process 23750, command: python test.py
==23750== Profiling application: python test.py
==23750== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 100.00% 2.3745ms 100 23.745us 23.263us 31.775us maxwell_sgemm_128x64_nn
API calls: 99.57% 684.76ms 100 6.8476ms 30.169us 681.30ms cudaLaunchKernel
0.19% 1.3045ms 100 13.044us 4.9430us 24.733us cuCtxSynchronize
0.17% 1.1561ms 1100 1.0510us 644ns 20.344us cudaGetDevice
0.06% 440.15us 300 1.4670us 673ns 19.476us cudaSetDevice
0.01% 47.054us 100 470ns 211ns 12.476us cudaGetLastError
0.00% 1.8030us 1 1.8030us 1.8030us 1.8030us cuDeviceGetCount
==23750== NVTX result:
==23750== Thread "<unnamed>" (id = 3543764096)
==23750== Domain "<unnamed>"
==23750== Range "host"
Type Time(%) Time Calls Avg Min Max Name
Range: 100.00% 692.98ms 100 6.9298ms 103.90us 681.43ms host
GPU activities: 100.00% 2.3745ms 100 23.745us 23.263us 31.775us maxwell_sgemm_128x64_nn
API calls: 100.00% 684.76ms 100 6.8476ms 30.169us 681.30ms cudaLaunchKernel
The NVTX host
range shows total host time, and is great for comparing both versions. I'm not sure why cudaLaunchKernel
occasionally takes ages, but CUBLAS is a black box so nothing to do about that. Looking at the averages and the minimal times, it seems like the Julia implementation performs just great? Any we're calling the same kernel so our API usage looks OK as well.
I see. thx
Not to say that there might be a performance difference in your case (might depend on your GPU, CUDA version, etc). If there is, just reopen an issue :slightly_smiling_face:
Tried
CUBLAS.gemm_batch
, it is much slower comparing totorch.bmm
, any idea why?And for PyTorch: