Open KristofferC opened 5 years ago
This is probably coming from the .+ b
here. During the forward pass b
gets broadcasted out which means the gradient needs to be collapsed back down again (by summing across the broadcasted dimensions).
Ideally our mapreducedim kernel would just be fast, but it's easier said than done to optimise these kinds of GPU kernels. I believe there was also some work on wrapping CUDNN's gradient function, which would do that reduction for us, but that's not hooked up yet.
There is https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html#cudnnConvolutionBiasActivationForward to do the whole forward pass in one shot and then one can use https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html#cudnnConvolutionBackwardBias and https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html#cudnnConvolutionBackwardData for the backward pass?
Yeah – the CUDNN wrappers were set up here, so it just needs someone to set up the right dispatch on the Flux side.
The slow mapreducedim
kernel is my fault, and I've learned since then that there's a more optimized kernel in Knet here that might help us understand what we're missing? Maybe a KnetArrays vs CuArrays benchmark can shed light on how big a difference it would make.
The integration from Flux's side is here #335. It needs a few fixes though.
Heh, I didn't know there already was an implementation so I did one myself (although worse than in the PR).
Getting:
so it seems even for CUDNN the bias term is dominating.
I ran the mnist model with the PR I mentioned.
==7862== Profiling application: julia
==7862== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 20.80% 208.80us 4 52.199us 14.560us 103.33us void cudnn::detail::wgrad_alg0_engine<float, int=512, int=6, int=5, int=3, int=3, int=3, bool=1, int=512>(int, int, int, float const *, int, cudnn::detail::wgrad_alg0_engine<float, int=512, int=6, int=5, int=3, int=3, int=3, bool=1, int=512>*, float const , kernel_grad_params, int, float, int, int, int, int)
17.70% 177.63us 2 88.815us 31.264us 146.37us void cudnn::detail::dgrad_engine<float, int=128, int=6, int=7, int=3, int=3, int=5, bool=1>(int, int, int, float const *, int, float const , int, cudnn::detail::dgrad_engine<float, int=128, int=6, int=7, int=3, int=3, int=5, bool=1>*, kernel_grad_params, int, int, float, int, int, int)
11.12% 111.65us 3 37.215us 25.280us 59.968us void cudnn::detail::implicit_convolve_sgemm<float, float, int=128, int=5, int=5, int=3, int=3, int=3, int=0, bool=1, bool=0, bool=1>(int, int, int, float const *, int, float*, cudnn::detail::implicit_convolve_sgemm<float, float, int=128, int=5, int=5, int=3, int=3, int=3, int=0, bool=1, bool=0, bool=1>*, kernel_conv_params, int, float, float, int, float, float, int, int)
10.17% 102.08us 4 25.520us 11.040us 43.232us void calc_bias_diff<int=2, float, float, int=128, int=0>(cudnnTensorStruct, float const *, cudnnTensorStruct, float*, float, float, int)
6.41% 64.319us 1 64.319us 64.319us 64.319us void cudnn::detail::implicit_convolve_sgemm<float, float, int=512, int=6, int=8, int=3, int=3, int=5, int=0, bool=1, bool=0, bool=1>(int, int, int, float const *, int, float*, cudnn::detail::implicit_convolve_sgemm<float, float, int=512, int=6, int=8, int=3, int=3, int=5, int=0, bool=1, bool=0, bool=1>*, kernel_conv_params, int, float, float, int, float, float, int, int)
5.79% 58.079us 4 14.519us 3.6160us 26.623us ptxcall_anonymous23_9
4.48% 44.960us 4 11.240us 2.2080us 21.664us ptxcall_anonymous23_4
3.57% 35.872us 1 35.872us 35.872us 35.872us void cudnn::detail::dgrad_engine<float, int=128, int=6, int=8, int=3, int=3, int=5, bool=1>(int, int, int, float const *, int, float const , int, cudnn::detail::dgrad_engine<float, int=128, int=6, int=8, int=3, int=3, int=5, bool=1>*, kernel_grad_params, int, int, float, int, int, int)
2.41% 24.159us 1 24.159us 24.159us 24.159us volta_sgemm_64x64_nn
I am getting quite low overhead for the bias term when the batch size is less (around 100). But increasing the batch size affects the bias term. It becomes around 28% of the time for a batch size of 1000.
Looking only at the forward pass we currently have:
GPU activities: 59.96% 184.91ms 2350 78.683us 29.184us 117.03us ptxcall_anonymous23_3
31.96% 98.559ms 2350 41.940us 16.960us 67.488us void cudnn::detail::implicit_convolve_sgemm<float, float, int=128, int=5, int=5, int=3, int=3, int=3, int=0, bool=1, bool=0, bool=1>(int, int, int, float const *, int, float*, cudnn::detail::implicit_convolve_sgemm<float, float, int=128, int=5, int=5, int=3, int=3, int=3, int=0, bool=1, bool=0, bool=1>*, kernel_conv_params, int, float, float, int, float, float, int, int)
while enabling cudnnConvolutionBiasActivationForward
we have:
GPU activities: 81.38% 108.64ms 2350 46.229us 18.368us 69.505us void cudnn::detail::implicit_convolve_sgemm<float, float, int=128, int=5, int=5, int=3, int=3, int=3, int=0, bool=1, bool=0, bool=1>(int, int, int, float const *, int, float*, cudnn::detail::implicit_convolve_sgemm<float, float, int=128, int=5, int=5, int=3, int=3, int=3, int=0, bool=1, bool=0, bool=1>*, kernel_conv_params, int, float, float, int, float, float, int, int)
by avoiding the anonymous kernel in applying the bias and activation function. I'll try make a PR for it.
Running the conv network for MNIST in the model-zoo the following profile is obtained:
The time in the mapreduce kernel (https://github.com/JuliaGPU/CuArrays.jl/blob/a3d2650db3eb62f25dcbe18a64ea0a0036caced4/src/mapreduce.jl#L27-L54) is probably a bit big. This seems to be coming from a call to
sum
following a call tounbroadcast
. I'm guessing this is from the activation function?The specific call to the mapreduce kernel is
Base._mapreducedim!(f::typeof(identity), op::typeof(Base.add_sum), R::CuArray{Float32}, A::CuArray{Float32})