Bottleneck in mapreducedim for convolutional layers

KristofferC commented 5 years ago

Running the conv network for MNIST in the model-zoo the following profile is obtained:

capture

The time in the mapreduce kernel (https://github.com/JuliaGPU/CuArrays.jl/blob/a3d2650db3eb62f25dcbe18a64ea0a0036caced4/src/mapreduce.jl#L27-L54) is probably a bit big. This seems to be coming from a call to sum following a call to unbroadcast. I'm guessing this is from the activation function?

The specific call to the mapreduce kernel is Base._mapreducedim!(f::typeof(identity), op::typeof(Base.add_sum), R::CuArray{Float32}, A::CuArray{Float32})

MikeInnes commented 5 years ago

This is probably coming from the .+ b here. During the forward pass b gets broadcasted out which means the gradient needs to be collapsed back down again (by summing across the broadcasted dimensions).

Ideally our mapreducedim kernel would just be fast, but it's easier said than done to optimise these kinds of GPU kernels. I believe there was also some work on wrapping CUDNN's gradient function, which would do that reduction for us, but that's not hooked up yet.

KristofferC commented 5 years ago

There is https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html#cudnnConvolutionBiasActivationForward to do the whole forward pass in one shot and then one can use https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html#cudnnConvolutionBackwardBias and https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html#cudnnConvolutionBackwardData for the backward pass?

MikeInnes commented 5 years ago

Yeah – the CUDNN wrappers were set up here, so it just needs someone to set up the right dispatch on the Flux side.

jekbradbury commented 5 years ago

The slow mapreducedim kernel is my fault, and I've learned since then that there's a more optimized kernel in Knet here that might help us understand what we're missing? Maybe a KnetArrays vs CuArrays benchmark can shed light on how big a difference it would make.

avik-pal commented 5 years ago

The integration from Flux's side is here #335. It needs a few fixes though.

KristofferC commented 5 years ago

Heh, I didn't know there already was an implementation so I did one myself (although worse than in the PR).

Getting:

capture2

so it seems even for CUDNN the bias term is dominating.

avik-pal commented 5 years ago

I ran the mnist model with the PR I mentioned.

==7862== Profiling application: julia
==7862== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   20.80%  208.80us         4  52.199us  14.560us  103.33us  void cudnn::detail::wgrad_alg0_engine<float, int=512, int=6, int=5, int=3, int=3, int=3, bool=1, int=512>(int, int, int, float const *, int, cudnn::detail::wgrad_alg0_engine<float, int=512, int=6, int=5, int=3, int=3, int=3, bool=1, int=512>*, float const , kernel_grad_params, int, float, int, int, int, int)
                   17.70%  177.63us         2  88.815us  31.264us  146.37us  void cudnn::detail::dgrad_engine<float, int=128, int=6, int=7, int=3, int=3, int=5, bool=1>(int, int, int, float const *, int, float const , int, cudnn::detail::dgrad_engine<float, int=128, int=6, int=7, int=3, int=3, int=5, bool=1>*, kernel_grad_params, int, int, float, int, int, int)
                   11.12%  111.65us         3  37.215us  25.280us  59.968us  void cudnn::detail::implicit_convolve_sgemm<float, float, int=128, int=5, int=5, int=3, int=3, int=3, int=0, bool=1, bool=0, bool=1>(int, int, int, float const *, int, float*, cudnn::detail::implicit_convolve_sgemm<float, float, int=128, int=5, int=5, int=3, int=3, int=3, int=0, bool=1, bool=0, bool=1>*, kernel_conv_params, int, float, float, int, float, float, int, int)
                   10.17%  102.08us         4  25.520us  11.040us  43.232us  void calc_bias_diff<int=2, float, float, int=128, int=0>(cudnnTensorStruct, float const *, cudnnTensorStruct, float*, float, float, int)
                    6.41%  64.319us         1  64.319us  64.319us  64.319us  void cudnn::detail::implicit_convolve_sgemm<float, float, int=512, int=6, int=8, int=3, int=3, int=5, int=0, bool=1, bool=0, bool=1>(int, int, int, float const *, int, float*, cudnn::detail::implicit_convolve_sgemm<float, float, int=512, int=6, int=8, int=3, int=3, int=5, int=0, bool=1, bool=0, bool=1>*, kernel_conv_params, int, float, float, int, float, float, int, int)
                    5.79%  58.079us         4  14.519us  3.6160us  26.623us  ptxcall_anonymous23_9
                    4.48%  44.960us         4  11.240us  2.2080us  21.664us  ptxcall_anonymous23_4
                    3.57%  35.872us         1  35.872us  35.872us  35.872us  void cudnn::detail::dgrad_engine<float, int=128, int=6, int=8, int=3, int=3, int=5, bool=1>(int, int, int, float const *, int, float const , int, cudnn::detail::dgrad_engine<float, int=128, int=6, int=8, int=3, int=3, int=5, bool=1>*, kernel_grad_params, int, int, float, int, int, int)
                    2.41%  24.159us         1  24.159us  24.159us  24.159us  volta_sgemm_64x64_nn

I am getting quite low overhead for the bias term when the batch size is less (around 100). But increasing the batch size affects the bias term. It becomes around 28% of the time for a batch size of 1000.

KristofferC commented 5 years ago

Looking only at the forward pass we currently have:

GPU activities:   59.96%  184.91ms      2350  78.683us  29.184us  117.03us  ptxcall_anonymous23_3
                  31.96%  98.559ms      2350  41.940us  16.960us  67.488us  void cudnn::detail::implicit_convolve_sgemm<float, float, int=128, int=5, int=5, int=3, int=3, int=3, int=0, bool=1, bool=0, bool=1>(int, int, int, float const *, int, float*, cudnn::detail::implicit_convolve_sgemm<float, float, int=128, int=5, int=5, int=3, int=3, int=3, int=0, bool=1, bool=0, bool=1>*, kernel_conv_params, int, float, float, int, float, float, int, int)

while enabling cudnnConvolutionBiasActivationForward we have:

 GPU activities:   81.38%  108.64ms      2350  46.229us  18.368us  69.505us  void cudnn::detail::implicit_convolve_sgemm<float, float, int=128, int=5, int=5, int=3, int=3, int=3, int=0, bool=1, bool=0, bool=1>(int, int, int, float const *, int, float*, cudnn::detail::implicit_convolve_sgemm<float, float, int=128, int=5, int=5, int=3, int=3, int=3, int=0, bool=1, bool=0, bool=1>*, kernel_conv_params, int, float, float, int, float, float, int, int)

by avoiding the anonymous kernel in applying the bias and activation function. I'll try make a PR for it.

FluxML / Flux.jl

Bottleneck in mapreducedim for convolutional layers #558