Open kli-casia opened 8 years ago
Is this running speed normal?
I don't know for this particular case, but in general, if you're in doubt, you can run the script with CUDA_LAUNCH_BLOCKING=1 THEANO_FLAGS=profile=1 python the_script.py
, stop it after one epoch and check if there are any CPU operations in the graph (i.e., operations that do not have Gpu
in their name). If so, something is wrong with the script or your setup. Does the Lasagne MNIST example run at normal speed (on a Titan X, it should probably take under 1 sec per epoch for the CNN architecture, and under .3 sec for the MLP)?
For me on Titan X with cuDNN v4, epoch time is about 126s.
Hi, I am also using a Titan X, cuDNN v5, unmodified code, but am getting epoch times of around 900 seconds:
Using gpu device 0: GeForce GTX TITAN X (CNMeM is disabled, cuDNN 5005)
Epoch 1 of 82 took 862.625s
training loss: 2.011102
validation loss: 1.258493
validation accuracy: 54.50 %
Epoch 2 of 82 took 914.957s
training loss: 1.258001
validation loss: 0.854177
validation accuracy: 70.54 %
Epoch 3 of 82 took 887.374s
training loss: 0.965086
validation loss: 0.753001
validation accuracy: 74.92 %
Interesting -- could you please check what I posted before (https://github.com/Lasagne/Recipes/issues/57#issuecomment-211312157)?
Hi, thanks for looking into this.
I think my system is correctly set up (python 2.7.11, 64bit, lasagne version 0.2.dev1). I verified that:
I ran the profiling (1 epoch, using 1,000 instead of all 100,000 training examples), and looks like it only uses GPU operations:
Using gpu device 0: GeForce GTX TITAN X (CNMeM is disabled, cuDNN 5005)
Function profiling
==================
Message: Deep_Residual_Learning_CIFAR-10.py:247
Time in 7 calls to Function.__call__: 7.454000e+00s
Time in Function.fn.__call__: 7.454000e+00s (100.000%)
Time in thunks: 5.195546e+00s (69.701%)
Total compile time: 5.661100e+01s
Number of Apply nodes: 2301
Theano Optimizer time: 5.039100e+01s
Theano validate time: 5.015000e+00s
Theano Linker time (includes C, CUDA code generation/compiling): 3.797000e+00s
Import time 2.960000e-01s
Time in all call to theano.grad() 1.765000e+00s
Time since theano import 99.413s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
77.5% 77.5% 4.029s 5.54e-04s C 7266 1038 theano.sandbox.cuda.basic_ops.GpuElemwise
9.2% 86.8% 0.479s 1.43e-03s C 336 48 theano.sandbox.cuda.basic_ops.GpuAllocEmpty
5.7% 92.5% 0.297s 1.68e-04s C 1764 252 theano.sandbox.cuda.basic_ops.GpuCAReduce
2.4% 94.9% 0.125s 5.96e-04s C 210 30 theano.sandbox.cuda.dnn.GpuDnnConvGradI
2.1% 97.0% 0.109s 5.04e-04s C 217 31 theano.sandbox.cuda.dnn.GpuDnnConvGradW
0.9% 97.9% 0.047s 9.57e-04s C 49 7 theano.sandbox.cuda.basic_ops.GpuFromHost
0.6% 98.5% 0.031s 1.44e-04s C 217 31 theano.sandbox.cuda.dnn.GpuDnnConv
0.6% 99.1% 0.031s 1.49e-03s C 21 3 theano.sandbox.cuda.basic_ops.GpuAlloc
0.3% 99.4% 0.016s 7.35e-06s C 2128 304 theano.sandbox.cuda.basic_ops.GpuDimShuffle
0.3% 99.7% 0.016s 2.30e-05s C 679 97 theano.tensor.opt.MakeVector
0.3% 100.0% 0.016s 5.58e-04s C 28 4 theano.sandbox.cuda.basic_ops.GpuIncSubtensor
0.0% 100.0% 0.000s 0.00e+00s C 959 137 theano.sandbox.cuda.basic_ops.GpuContiguous
0.0% 100.0% 0.000s 0.00e+00s C 896 128 theano.compile.ops.Shape_i
0.0% 100.0% 0.000s 0.00e+00s C 644 92 theano.sandbox.cuda.dnn.GpuDnnConvDesc
0.0% 100.0% 0.000s 0.00e+00s C 518 74 theano.tensor.elemwise.Elemwise
0.0% 100.0% 0.000s 0.00e+00s C 91 13 theano.sandbox.cuda.basic_ops.GpuSubtensor
0.0% 100.0% 0.000s 0.00e+00s C 28 4 theano.sandbox.cuda.basic_ops.HostFromGpu
0.0% 100.0% 0.000s 0.00e+00s C 14 2 theano.sandbox.cuda.basic_ops.GpuReshape
0.0% 100.0% 0.000s 0.00e+00s C 14 2 theano.sandbox.cuda.blas.GpuDot22
0.0% 100.0% 0.000s 0.00e+00s C 7 1 theano.sandbox.cuda.nnet.GpuCrossentropySoftmax1HotWithBiasDx
... (remaining 3 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
15.6% 15.6% 0.810s 8.57e-04s C 945 135 GpuElemwise{sub,no_inplace}
9.2% 24.8% 0.479s 1.43e-03s C 336 48 GpuAllocEmpty
8.8% 33.6% 0.457s 2.18e-03s C 210 30 GpuElemwise{sgn,no_inplace}
7.9% 41.5% 0.411s 1.95e-03s C 210 30 GpuElemwise{Composite{(((i0 + i1) + ((i2 * i3 * i4) / i5)) + i6)},no_inplace}
7.3% 48.8% 0.380s 1.75e-03s C 217 31 GpuElemwise{Composite{((i0 * i1 * i2) + (i0 * i1 * i2 * i3))},no_inplace}
6.4% 55.2% 0.333s 1.53e-03s C 217 31 GpuElemwise{Composite{((i0 * i1) + (i0 * i1 * i2))},no_inplace}
5.8% 61.0% 0.300s 1.38e-03s C 217 31 GpuElemwise{Composite{(i0 * (i1 + Abs(i1)))},no_inplace}
5.4% 66.4% 0.282s 1.30e-03s C 217 31 GpuElemwise{Composite{(((i0 * i1 * i2) + (i0 * i1 * i2 * i3)) * i4)},no_inplace}
4.2% 70.7% 0.220s 1.01e-03s C 217 31 GpuElemwise{Composite{((i0 * i1) / i2)},no_inplace}
3.9% 74.6% 0.203s 1.81e-03s C 112 16 GpuElemwise{Composite{((i0 * i1) + i2)},no_inplace}
3.6% 78.2% 0.188s 1.73e-04s C 1085 155 GpuCAReduce{add}{1,0,1,1}
3.2% 81.4% 0.165s 1.57e-03s C 105 15 GpuElemwise{Composite{(((i0 * i1) + i2) + i3)},no_inplace}
2.4% 83.8% 0.125s 5.96e-04s C 210 30 GpuDnnConvGradI{algo='none', inplace=True}
2.4% 86.2% 0.125s 1.16e-04s C 1078 154 GpuElemwise{mul,no_inplace}
2.1% 88.3% 0.109s 5.04e-04s C 217 31 GpuDnnConvGradW{algo='none', inplace=True}
1.8% 90.1% 0.094s 4.32e-04s C 217 31 GpuCAReduce{pre=sqr,red=add}{1,0,1,1}
1.2% 91.3% 0.062s 1.10e-04s C 567 81 GpuElemwise{Composite{(((i0 / i1) / i2) / i3)},no_inplace}
0.9% 92.2% 0.047s 3.72e-04s C 126 18 GpuElemwise{Sub}[(0, 1)]
0.9% 93.1% 0.047s 7.20e-05s C 651 93 GpuElemwise{Composite{(i0 + (i1 - i2))},no_inplace}
0.9% 94.0% 0.047s 2.31e-04s C 203 29 GpuElemwise{Composite{(i0 * ((i1 * i2) + (i3 * i4)))},no_inplace}
... (remaining 63 Ops account for 6.02%(0.31s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
1.2% 1.2% 0.064s 9.10e-03s 7 651 GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0)
1.2% 2.4% 0.063s 8.93e-03s 7 2225 GpuElemwise{Composite{(((i0 * i1 * i2) + (i0 * i1 * i2 * i3)) * i4)},no_inplace}(CudaNdarrayConstant{[[[[ 0.5]]]]}, GpuDnnConvGradI{algo='none', inplace=True}.0, GpuDimShuffle{x,0,x,x}.0, GpuElemwise{sgn,no_inplace}.0, GpuDimShuffle{x,0,x,x}.0)
1.2% 3.6% 0.063s 8.93e-03s 7 956 GpuElemwise{sub,no_inplace}(GpuDnnConv{algo='small', inplace=True}.0, GpuElemwise{Composite{(((i0 / i1) / i2) / i3)},no_inplace}.0)
1.2% 4.8% 0.063s 8.93e-03s 7 885 GpuElemwise{sgn,no_inplace}(GpuElemwise{Composite{((i0 * i1) + i2)},no_inplace}.0)
0.9% 5.8% 0.049s 6.96e-03s 7 1159 GpuElemwise{sub,no_inplace}(GpuDnnConv{algo='small', inplace=True}.0, GpuDimShuffle{x,0,x,x}.0)
0.9% 6.7% 0.047s 6.73e-03s 7 2144 GpuElemwise{Composite{((i0 * i1 * i2) + (i0 * i1 * i2 * i3))},no_inplace}(CudaNdarrayConstant{[[[[ 0.5]]]]}, GpuDnnConvGradI{algo='none', inplace=True}.0, GpuElemwise{sub,no_inplace}.0, GpuElemwise{sgn,no_inplace}.0)
0.9% 7.6% 0.047s 6.70e-03s 7 2034 GpuElemwise{Composite{((i0 * i1) + (i0 * i1 * i2))},no_inplace}(CudaNdarrayConstant{[[[[ 0.5]]]]}, GpuDnnConvGradI{algo='none', inplace=True}.0, GpuElemwise{sgn,no_inplace}.0)
0.9% 8.5% 0.047s 6.70e-03s 7 1979 GpuElemwise{Composite{((i0 * i1) + (i0 * i1 * i2))},no_inplace}(CudaNdarrayConstant{[[[[ 0.5]]]]}, GpuDnnConvGradI{algo='none', inplace=True}.0, GpuElemwise{sgn,no_inplace}.0)
0.9% 9.4% 0.047s 6.70e-03s 7 2117 GpuElemwise{Composite{(((i0 * i1 * i2) + (i0 * i1 * i2 * i3)) * i4)},no_inplace}(CudaNdarrayConstant{[[[[ 0.5]]]]}, GpuDnnConvGradI{algo='none', inplace=True}.0, GpuDimShuffle{x,0,x,x}.0, GpuElemwise{sgn,no_inplace}.0, GpuDimShuffle{x,0,x,x}.0)
0.9% 10.3% 0.047s 6.70e-03s 7 618 GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0)
0.9% 11.2% 0.047s 6.70e-03s 7 594 GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0)
0.9% 12.1% 0.047s 6.70e-03s 7 2244 GpuElemwise{Composite{(((i0 + i1) + ((i2 * i3 * i4) / i5)) + i6)},no_inplace}(GpuElemwise{Composite{(((i0 * i1 * i2) + (i0 * i1 * i2 * i3)) * i4)},no_inplace}.0, GpuElemwise{Composite{((((-i0) / i1) / i2) / i3)},no_inplace}.0, CudaNdarrayConstant{[[[[-1.]]]]}, GpuElemwise{mul,no_inplace}.0, GpuElemwise{sub,no_inplace}.0, GpuElemwise{mul,no_inplace}.0, GpuElemwise{Composite{(((i0 / i1) / i2) / i3)},no_inplace}.0)
0.9% 13.0% 0.047s 6.70e-03s 7 2145 GpuElemwise{Composite{((i0 * i1) + (i0 * i1 * i2))},no_inplace}(CudaNdarrayConstant{[[[[ 0.5]]]]}, GpuDnnConvGradI{algo='none', inplace=True}.0, GpuElemwise{sgn,no_inplace}.0)
0.9% 13.9% 0.047s 6.70e-03s 7 1779 GpuElemwise{Composite{(((i0 + i1) + ((i2 * i3 * i4) / i5)) + i6)},no_inplace}(GpuElemwise{Composite{(((i0 * i1 * i2) + (i0 * i1 * i2 * i3)) * i4)},no_inplace}.0, GpuElemwise{Composite{((((-i0) / i1) / i2) / i3)},no_inplace}.0, CudaNdarrayConstant{[[[[-1.]]]]}, GpuElemwise{mul,no_inplace}.0, GpuElemwise{sub,no_inplace}.0, GpuElemwise{mul,no_inplace}.0, GpuElemwise{Composite{(((i0 / i1) / i2) / i3)},no_inplace}.0)
0.9% 14.8% 0.047s 6.70e-03s 7 906 GpuElemwise{sgn,no_inplace}(GpuElemwise{Composite{(((i0 * i1) + i2) + i3)},no_inplace}.0)
0.9% 15.7% 0.047s 6.70e-03s 7 2138 GpuElemwise{Composite{(((i0 + i1) + ((i2 * i3 * i4) / i5)) + i6)},no_inplace}(GpuElemwise{Composite{(((i0 * i1 * i2) + (i0 * i1 * i2 * i3)) * i4)},no_inplace}.0, GpuElemwise{Composite{((((-i0) / i1) / i2) / i3)},no_inplace}.0, CudaNdarrayConstant{[[[[-1.]]]]}, GpuElemwise{mul,no_inplace}.0, GpuElemwise{sub,no_inplace}.0, GpuElemwise{mul,no_inplace}.0, GpuElemwise{Composite{(((i0 / i1) / i2) / i3)}}[(0, 0)].0)
0.9% 16.6% 0.047s 6.70e-03s 7 1025 GpuElemwise{Composite{(((i0 * i1) + i2) + i3)},no_inplace}(GpuElemwise{sub,no_inplace}.0, GpuElemwise{mul,no_inplace}.0, GpuDimShuffle{x,0,x,x}.0, GpuElemwise{Composite{(i0 * (i1 + Abs(i1)))},no_inplace}.0)
0.9% 17.5% 0.047s 6.70e-03s 7 936 GpuElemwise{sub,no_inplace}(GpuDnnConv{algo='small', inplace=True}.0, GpuElemwise{Composite{(((i0 / i1) / i2) / i3)},no_inplace}.0)
0.9% 18.4% 0.047s 6.70e-03s 7 976 GpuElemwise{sub,no_inplace}(GpuDnnConv{algo='small', inplace=True}.0, GpuElemwise{Composite{(((i0 / i1) / i2) / i3)},no_inplace}.0)
0.9% 19.3% 0.047s 6.70e-03s 7 2062 GpuElemwise{Composite{((i0 * i1) + (i0 * i1 * i2))},no_inplace}(CudaNdarrayConstant{[[[[ 0.5]]]]}, GpuDnnConvGradI{algo='none', inplace=True}.0, GpuElemwise{sgn,no_inplace}.0)
... (remaining 2281 Apply instances account for 80.69%(4.19s) of the runtime)
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide a speedup.
Sorry, no tip for today.
Function profiling
==================
Message: Deep_Residual_Learning_CIFAR-10.py:258
Time in 40 calls to Function.__call__: 1.340600e+01s
Time in Function.fn.__call__: 1.340600e+01s (100.000%)
Time in thunks: 9.471744e+00s (70.653%)
Total compile time: 5.047000e+00s
Number of Apply nodes: 597
Theano Optimizer time: 4.375000e+00s
Theano validate time: 2.189999e-01s
Theano Linker time (includes C, CUDA code generation/compiling): 5.779998e-01s
Import time 1.600003e-02s
Time in all call to theano.grad() 1.765000e+00s
Time since theano import 99.867s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
71.5% 71.5% 6.776s 5.46e-03s C 1240 31 theano.sandbox.cuda.basic_ops.GpuAllocEmpty
12.5% 84.1% 1.188s 9.58e-04s C 1240 31 theano.sandbox.cuda.dnn.GpuDnnConv
9.6% 93.7% 0.907s 3.49e-04s C 2600 65 theano.sandbox.cuda.basic_ops.GpuElemwise
2.9% 96.5% 0.273s 3.41e-03s C 80 2 theano.sandbox.cuda.basic_ops.GpuAlloc
2.6% 99.2% 0.250s 1.56e-03s C 160 4 theano.sandbox.cuda.basic_ops.GpuFromHost
0.3% 99.5% 0.031s 3.91e-04s C 80 2 theano.sandbox.cuda.basic_ops.GpuIncSubtensor
0.2% 99.7% 0.016s 1.09e-05s C 1440 36 theano.tensor.opt.MakeVector
0.2% 99.8% 0.016s 1.95e-04s C 80 2 theano.sandbox.cuda.basic_ops.HostFromGpu
0.2% 100.0% 0.016s 1.95e-04s C 80 2 theano.sandbox.cuda.basic_ops.GpuCAReduce
0.0% 100.0% 0.000s 0.00e+00s C 5120 128 theano.compile.ops.Shape_i
0.0% 100.0% 0.000s 0.00e+00s C 5000 125 theano.sandbox.cuda.basic_ops.GpuDimShuffle
0.0% 100.0% 0.000s 0.00e+00s C 2720 68 theano.tensor.elemwise.Elemwise
0.0% 100.0% 0.000s 0.00e+00s C 2480 62 theano.sandbox.cuda.basic_ops.GpuContiguous
0.0% 100.0% 0.000s 0.00e+00s C 1240 31 theano.sandbox.cuda.dnn.GpuDnnConvDesc
0.0% 100.0% 0.000s 0.00e+00s C 80 2 theano.sandbox.cuda.basic_ops.GpuSubtensor
0.0% 100.0% 0.000s 0.00e+00s C 40 1 theano.sandbox.cuda.nnet.GpuCrossentropySoftmaxArgmax1HotWithBias
0.0% 100.0% 0.000s 0.00e+00s C 40 1 theano.sandbox.cuda.basic_ops.GpuReshape
0.0% 100.0% 0.000s 0.00e+00s C 40 1 theano.tensor.elemwise.DimShuffle
0.0% 100.0% 0.000s 0.00e+00s C 40 1 theano.tensor.basic.MaxAndArgmax
0.0% 100.0% 0.000s 0.00e+00s C 40 1 theano.sandbox.cuda.blas.GpuDot22
... (remaining 1 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
71.5% 71.5% 6.776s 5.46e-03s C 1240 31 GpuAllocEmpty
12.5% 84.1% 1.188s 9.58e-04s C 1240 31 GpuDnnConv{algo='small', inplace=True}
4.6% 88.7% 0.438s 6.84e-04s C 640 16 GpuElemwise{Composite{(i0 * (Composite{(((i0 - i1) * i2) + i3)}(i1, i2, i3, i4) + Abs(Composite{(((i0 - i1) * i2) + i3)}(i1, i2, i3, i4))))}}[(0, 1)]
4.5% 93.2% 0.422s 7.03e-04s C 600 15 GpuElemwise{Composite{(i0 * (Composite{((((i0 - i1) * i2) + i3) + i4)}(i1, i2, i3, i4, i5) + Abs(Composite{((((i0 - i1) * i2) + i3) + i4)}(i1, i2, i3, i4, i5))))}}[(0, 1)]
2.9% 96.0% 0.273s 3.41e-03s C 80 2 GpuAlloc{memset_0=True}
2.6% 98.7% 0.250s 1.56e-03s C 160 4 GpuFromHost
0.5% 99.2% 0.047s 3.78e-05s C 1240 31 GpuElemwise{mul,no_inplace}
0.3% 99.5% 0.031s 3.91e-04s C 80 2 GpuIncSubtensor{InplaceSet;::, int64:int64:, int64:int64:, int64:int64:}
0.2% 99.7% 0.016s 1.09e-05s C 1440 36 MakeVector{dtype='int64'}
0.2% 99.8% 0.016s 1.95e-04s C 80 2 HostFromGpu
0.2% 100.0% 0.016s 3.90e-04s C 40 1 GpuCAReduce{add}{0,0,1}
0.0% 100.0% 0.000s 0.00e+00s C 4960 124 GpuDimShuffle{x,0,x,x}
0.0% 100.0% 0.000s 0.00e+00s C 2480 62 GpuContiguous
0.0% 100.0% 0.000s 0.00e+00s C 2320 58 Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}
0.0% 100.0% 0.000s 0.00e+00s C 1280 32 Shape_i{2}
0.0% 100.0% 0.000s 0.00e+00s C 1280 32 Shape_i{0}
0.0% 100.0% 0.000s 0.00e+00s C 1280 32 Shape_i{3}
0.0% 100.0% 0.000s 0.00e+00s C 1280 32 Shape_i{1}
0.0% 100.0% 0.000s 0.00e+00s C 1160 29 GpuDnnConvDesc{border_mode='half', subsample=(1, 1), conv_mode='cross', precision='float32'}
0.0% 100.0% 0.000s 0.00e+00s C 160 4 Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i1) + i3)}}
... (remaining 19 Ops account for 0.00%(0.00s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
4.3% 4.3% 0.408s 1.02e-02s 40 426 GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0)
4.2% 8.5% 0.400s 1.00e-02s 40 430 GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0)
4.2% 12.7% 0.397s 9.93e-03s 40 422 GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0)
4.0% 16.8% 0.382s 9.55e-03s 40 424 GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0)
3.9% 20.6% 0.366s 9.14e-03s 40 442 GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0)
3.8% 24.4% 0.363s 9.07e-03s 40 436 GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0)
3.8% 28.3% 0.362s 9.04e-03s 40 438 GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0)
3.8% 32.1% 0.360s 8.99e-03s 40 432 GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0)
3.7% 35.8% 0.351s 8.77e-03s 40 428 GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0)
3.5% 39.3% 0.332s 8.31e-03s 40 434 GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0)
3.4% 42.6% 0.318s 7.96e-03s 40 440 GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0)
2.5% 45.1% 0.236s 5.89e-03s 40 462 GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0)
2.5% 47.6% 0.235s 5.86e-03s 40 4 GpuFromHost(inputs)
2.2% 49.8% 0.206s 5.16e-03s 40 452 GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0)
2.2% 51.9% 0.205s 5.12e-03s 40 458 GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0)
2.0% 54.0% 0.192s 4.81e-03s 40 454 GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0)
2.0% 55.9% 0.187s 4.69e-03s 40 450 GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0)
1.9% 57.8% 0.179s 4.48e-03s 40 289 GpuAlloc{memset_0=True}(CudaNdarrayConstant{0.0}, Shape_i{0}.0, TensorConstant{32}, TensorConstant{16}, TensorConstant{16})
1.9% 59.7% 0.177s 4.43e-03s 40 444 GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0)
1.8% 61.5% 0.174s 4.34e-03s 40 446 GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i1) + i3)}}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i1) + i3)}}.0)
... (remaining 577 Apply instances account for 38.46%(3.64s) of the runtime)
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide a speedup.
Sorry, no tip for today.
Function profiling
==================
Message: Sum of all(2) printed profiles at exit excluding Scan op profile.
Time in 47 calls to Function.__call__: 2.086000e+01s
Time in Function.fn.__call__: 2.086000e+01s (100.000%)
Time in thunks: 1.466729e+01s (70.313%)
Total compile time: 6.165800e+01s
Number of Apply nodes: 2301
Theano Optimizer time: 5.476600e+01s
Theano validate time: 5.234000e+00s
Theano Linker time (includes C, CUDA code generation/compiling): 4.375000e+00s
Import time 3.120000e-01s
Time in all call to theano.grad() 1.765000e+00s
Time since theano import 99.945s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
49.5% 49.5% 7.255s 4.60e-03s C 1576 79 theano.sandbox.cuda.basic_ops.GpuAllocEmpty
33.6% 83.1% 4.936s 5.00e-04s C 9866 1103 theano.sandbox.cuda.basic_ops.GpuElemwise
8.3% 91.4% 1.219s 8.37e-04s C 1457 62 theano.sandbox.cuda.dnn.GpuDnnConv
2.1% 93.6% 0.313s 1.69e-04s C 1844 254 theano.sandbox.cuda.basic_ops.GpuCAReduce
2.1% 95.6% 0.304s 3.01e-03s C 101 5 theano.sandbox.cuda.basic_ops.GpuAlloc
2.0% 97.7% 0.297s 1.42e-03s C 209 11 theano.sandbox.cuda.basic_ops.GpuFromHost
0.9% 98.5% 0.125s 5.96e-04s C 210 30 theano.sandbox.cuda.dnn.GpuDnnConvGradI
0.7% 99.3% 0.109s 5.04e-04s C 217 31 theano.sandbox.cuda.dnn.GpuDnnConvGradW
0.3% 99.6% 0.047s 4.34e-04s C 108 6 theano.sandbox.cuda.basic_ops.GpuIncSubtensor
0.2% 99.8% 0.031s 1.47e-05s C 2119 133 theano.tensor.opt.MakeVector
0.1% 99.9% 0.016s 2.19e-06s C 7128 429 theano.sandbox.cuda.basic_ops.GpuDimShuffle
0.1% 100.0% 0.016s 1.45e-04s C 108 6 theano.sandbox.cuda.basic_ops.HostFromGpu
0.0% 100.0% 0.000s 0.00e+00s C 6016 256 theano.compile.ops.Shape_i
0.0% 100.0% 0.000s 0.00e+00s C 3439 199 theano.sandbox.cuda.basic_ops.GpuContiguous
0.0% 100.0% 0.000s 0.00e+00s C 3238 142 theano.tensor.elemwise.Elemwise
0.0% 100.0% 0.000s 0.00e+00s C 1884 123 theano.sandbox.cuda.dnn.GpuDnnConvDesc
0.0% 100.0% 0.000s 0.00e+00s C 171 15 theano.sandbox.cuda.basic_ops.GpuSubtensor
0.0% 100.0% 0.000s 0.00e+00s C 54 3 theano.sandbox.cuda.basic_ops.GpuReshape
0.0% 100.0% 0.000s 0.00e+00s C 54 3 theano.sandbox.cuda.blas.GpuDot22
0.0% 100.0% 0.000s 0.00e+00s C 47 2 theano.sandbox.cuda.nnet.GpuCrossentropySoftmaxArgmax1HotWithBias
... (remaining 6 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
49.5% 49.5% 7.255s 4.60e-03s C 1576 79 GpuAllocEmpty
8.3% 57.8% 1.219s 8.37e-04s C 1457 62 GpuDnnConv{algo='small', inplace=True}
5.5% 63.3% 0.810s 8.57e-04s C 945 135 GpuElemwise{sub,no_inplace}
3.1% 66.4% 0.457s 2.18e-03s C 210 30 GpuElemwise{sgn,no_inplace}
3.0% 69.4% 0.438s 6.84e-04s C 640 16 GpuElemwise{Composite{(i0 * (Composite{(((i0 - i1) * i2) + i3)}(i1, i2, i3, i4) + Abs(Composite{(((i0 - i1) * i2) + i3)}(i1, i2, i3, i4))))}}[(0, 1)]
2.9% 72.3% 0.422s 7.03e-04s C 600 15 GpuElemwise{Composite{(i0 * (Composite{((((i0 - i1) * i2) + i3) + i4)}(i1, i2, i3, i4, i5) + Abs(Composite{((((i0 - i1) * i2) + i3) + i4)}(i1, i2, i3, i4, i5))))}}[(0, 1)]
2.8% 75.1% 0.411s 1.95e-03s C 210 30 GpuElemwise{Composite{(((i0 + i1) + ((i2 * i3 * i4) / i5)) + i6)},no_inplace}
2.6% 77.7% 0.380s 1.75e-03s C 217 31 GpuElemwise{Composite{((i0 * i1 * i2) + (i0 * i1 * i2 * i3))},no_inplace}
2.3% 79.9% 0.333s 1.53e-03s C 217 31 GpuElemwise{Composite{((i0 * i1) + (i0 * i1 * i2))},no_inplace}
2.0% 82.0% 0.300s 1.38e-03s C 217 31 GpuElemwise{Composite{(i0 * (i1 + Abs(i1)))},no_inplace}
2.0% 84.0% 0.297s 1.42e-03s C 209 11 GpuFromHost
2.0% 86.0% 0.289s 3.07e-03s C 94 4 GpuAlloc{memset_0=True}
1.9% 87.9% 0.282s 1.30e-03s C 217 31 GpuElemwise{Composite{(((i0 * i1 * i2) + (i0 * i1 * i2 * i3)) * i4)},no_inplace}
1.5% 89.4% 0.220s 1.01e-03s C 217 31 GpuElemwise{Composite{((i0 * i1) / i2)},no_inplace}
1.4% 90.8% 0.203s 1.81e-03s C 112 16 GpuElemwise{Composite{((i0 * i1) + i2)},no_inplace}
1.3% 92.1% 0.188s 1.73e-04s C 1085 155 GpuCAReduce{add}{1,0,1,1}
1.2% 93.2% 0.172s 7.42e-05s C 2318 185 GpuElemwise{mul,no_inplace}
1.1% 94.4% 0.165s 1.57e-03s C 105 15 GpuElemwise{Composite{(((i0 * i1) + i2) + i3)},no_inplace}
0.9% 95.2% 0.125s 5.96e-04s C 210 30 GpuDnnConvGradI{algo='none', inplace=True}
0.7% 96.0% 0.109s 5.04e-04s C 217 31 GpuDnnConvGradW{algo='none', inplace=True}
... (remaining 73 Ops account for 4.05%(0.59s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
2.8% 2.8% 0.408s 1.02e-02s 40 426 GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0)
2.7% 5.5% 0.400s 1.00e-02s 40 430 GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0)
2.7% 8.2% 0.397s 9.93e-03s 40 422 GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0)
2.6% 10.8% 0.382s 9.55e-03s 40 424 GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0)
2.5% 13.3% 0.366s 9.14e-03s 40 442 GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0)
2.5% 15.8% 0.363s 9.07e-03s 40 436 GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0)
2.5% 18.3% 0.362s 9.04e-03s 40 438 GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0)
2.5% 20.7% 0.360s 8.99e-03s 40 432 GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0)
2.4% 23.1% 0.351s 8.77e-03s 40 428 GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0)
2.3% 25.4% 0.332s 8.31e-03s 40 434 GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0)
2.2% 27.5% 0.318s 7.96e-03s 40 440 GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0)
1.6% 29.1% 0.236s 5.89e-03s 40 462 GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0)
1.6% 30.7% 0.235s 5.86e-03s 40 4 GpuFromHost(inputs)
1.4% 32.1% 0.206s 5.16e-03s 40 452 GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0)
1.4% 33.5% 0.205s 5.12e-03s 40 458 GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0)
1.3% 34.9% 0.192s 4.81e-03s 40 454 GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0)
1.3% 36.1% 0.187s 4.69e-03s 40 450 GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0)
1.2% 37.4% 0.179s 4.48e-03s 40 289 GpuAlloc{memset_0=True}(CudaNdarrayConstant{0.0}, Shape_i{0}.0, TensorConstant{32}, TensorConstant{16}, TensorConstant{16})
1.2% 38.6% 0.177s 4.43e-03s 40 444 GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0)
1.2% 39.7% 0.174s 4.34e-03s 40 446 GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i1) + i3)}}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i1) + i3)}}.0)
... (remaining 2878 Apply instances account for 60.26%(8.84s) of the runtime)
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide a speedup.
Sorry, no tip for today.
Lasagne's CNN MNIST example on my machine takes 9 seconds per iteration with cuDNN, and 17 without. I could not find a reference time, so can only guess that this is as expected.
As I said above (https://github.com/Lasagne/Recipes/issues/57#issuecomment-211312157), python examples/mnist.py cnn 10
should take under one second per epoch on a Titan X. On my desktop (GT 640), the Lasagne MNIST example for the CNN architecture takes 8.7s per epoch, and on a GTX 780 Ti it takes 1.1s per epoch. So 9 seconds is a lot too much for your hardware. Are you the only one running code on that GPU, or are there other concurrent experiments running?
I ran the profiling (1 epoch, using 1,000 instead of all 100,000 training examples), and looks like it only uses GPU operations:
There are no CPU Ops, right. But the runtime is dominated by GpuAllocEmpty -- this is not expected.
THEANO_FLAGS=lib.cnmem=0.3,profile=1 CUDA_LAUNCH_BLOCKING=1 python Deep_Residual_Learning_CIFAR-10.py
?THEANO_FLAGS=profile=1 CUDA_LAUNCH_BLOCKING=1 python examples/mnist.py cnn 10
?Wow, that was it: after adding the cnmem=0.3 flag, the time per epoch went down to 95 seconds for CIFAR, and 0.4 seconds for MNIST. Thanks a lot! Is there any profiling you want me to do?
Is there any profiling you want me to do?
I'd still like to see the profile for the MNIST example without cnmem, if possible: CUDA_LAUNCH_BLOCKING=1 THEANO_FLAGS=lib.cnmem=0,allow_gc=1,profile=1 python examples/mnist.py cnn 10
For me, this increases per-epoch time from 1.1s to 1.6s, but not to 9s as for your machine. There's still something fishy!
OMG, I even didn't notice the cnmem flag before... I am also using Ubuntu14.04 + The latest version of theano and lasagne + TITIAN X + cudnn v5.
when cnmem is disabled, the time per epoch for Deep_Residual_Learning_CIFAR-10.py
is 170 seconds.
when cnmem=0.3, the time per epoch went down to 104 seconds, huge speed up!
@PatrickBue , you get 95 seconds one epoch, faster than me...
It's better to enable cnmem when train CNN, am I right? @f0k
@kli-nlpr: Try these flags here. Brings the 95 seconds per epoch down to 82 on my system:
[dnn.conv]
algo_bwd_filter = time_once
algo_bwd_data = time_once
algo_fwd = time_once
OMG, I even didn't notice the cnmem flag before...
It should be set by default in future, but they didn't do it yet.
It's better to enable cnmem when train CNN, am I right?
Yes, as we recommend here: https://github.com/Lasagne/Lasagne/wiki/From-Zero-to-Lasagne-on-Ubuntu-14.04#cudnn
If you can spare the memory, you can also set allow_gc=0
for some minor extra speedup. It will leave temporaries allocated between Ops and function calls instead of deallocating and allocating them each time. With cnMeM, allocation and deallocation is a lot faster, so it doesn't make a lot of difference, but it can still improve performance.
@f0k, this is the profile output for mnist.py using the flags you suggested:
Loading data...
Building model and compiling functions...
Starting training...
Epoch 1 of 10 took 9.963s
training loss: 1.544144
validation loss: 0.337871
validation accuracy: 91.11 %
Epoch 2 of 10 took 9.747s
training loss: 0.496699
validation loss: 0.178585
validation accuracy: 95.02 %
Epoch 3 of 10 took 9.661s
training loss: 0.338563
validation loss: 0.131325
validation accuracy: 96.22 %
Epoch 4 of 10 took 9.877s
training loss: 0.277647
validation loss: 0.113841
validation accuracy: 96.78 %
Epoch 5 of 10 took 9.854s
training loss: 0.240854
validation loss: 0.098375
validation accuracy: 97.19 %
Epoch 6 of 10 took 9.697s
training loss: 0.215437
validation loss: 0.087286
validation accuracy: 97.51 %
Epoch 7 of 10 took 9.794s
training loss: 0.191024
validation loss: 0.079255
validation accuracy: 97.65 %
Epoch 8 of 10 took 9.742s
training loss: 0.179124
validation loss: 0.072846
validation accuracy: 97.88 %
Epoch 9 of 10 took 9.578s
training loss: 0.168340
validation loss: 0.068834
validation accuracy: 97.94 %
Epoch 10 of 10 took 9.868s
training loss: 0.156605
validation loss: 0.067654
validation accuracy: 98.06 %
Final results:
test loss: 0.061727
test accuracy: 97.99 %
DEBUG: nvcc STDOUT mod.cu
Creating library C:/Users/pabuehle/AppData/Local/Theano/compiledir_Windows-10-10.0.10586-Intel64_Family_6_Model_63_Stepping_2_GenuineIntel-2.7.11-64/tmpom4ivu/265abc51f7c376c224983485238ff1a5.lib and object C:/Users/pabuehle/AppData/Local/Theano/compiledir_Windows-10-10.0.10586-Intel64_Family_6_Model_63_Stepping_2_GenuineIntel-2.7.11-64/tmpom4ivu/265abc51f7c376c224983485238ff1a5.exp
Using gpu device 0: GeForce GTX TITAN X (CNMeM is disabled, cuDNN 5005)
Function profiling
==================
Message: mnist.py:283
Time in 1000 calls to Function.__call__: 9.057300e+01s
Time in Function.fn.__call__: 9.040100e+01s (99.810%)
Time in thunks: 6.183903e+01s (68.275%)
Total compile time: 1.594000e+00s
Number of Apply nodes: 124
Theano Optimizer time: 1.047000e+00s
Theano validate time: 7.800007e-02s
Theano Linker time (includes C, CUDA code generation/compiling): 5.000000e-01s
Import time 3.580000e-01s
Time in all call to theano.grad() 3.100014e-02s
Time since theano import 113.055s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
29.0% 29.0% 17.947s 5.13e-04s C 35000 35 theano.sandbox.cuda.basic_ops.GpuElemwise
23.8% 52.8% 14.699s 2.94e-03s C 5000 5 theano.sandbox.cuda.basic_ops.GpuAllocEmpty
23.2% 75.9% 14.317s 7.16e-03s C 2000 2 theano.sandbox.cuda.dnn.GpuDnnPoolGrad
6.7% 82.6% 4.112s 2.06e-03s C 2000 2 theano.sandbox.cuda.dnn.GpuDnnPool
3.5% 86.1% 2.142s 2.14e-03s C 1000 1 theano.sandbox.cuda.dnn.GpuDnnConvGradI
2.8% 88.9% 1.752s 8.76e-04s C 2000 2 theano.sandbox.cuda.dnn.GpuDnnConvGradW
2.5% 91.4% 1.535s 3.84e-04s C 4000 4 theano.sandbox.cuda.blas.GpuDot22
2.5% 93.8% 1.531s 7.66e-04s C 2000 2 theano.sandbox.cuda.dnn.GpuDnnConv
2.1% 95.9% 1.298s 2.60e-04s C 5000 5 theano.sandbox.cuda.basic_ops.GpuCAReduce
1.3% 97.3% 0.815s 4.07e-04s C 2000 2 theano.sandbox.cuda.blas.GpuDot22Scalar
1.2% 98.5% 0.737s 2.46e-04s C 3000 3 theano.sandbox.cuda.basic_ops.GpuFromHost
0.9% 99.4% 0.564s 2.82e-04s C 2000 2 theano.sandbox.rng_mrg.GPU_mrg_uniform
0.2% 99.5% 0.109s 1.09e-04s C 1000 1 theano.sandbox.cuda.nnet.GpuCrossentropySoftmaxArgmax1HotWithBias
0.2% 99.7% 0.094s 9.38e-05s C 1000 1 theano.sandbox.cuda.nnet.GpuCrossentropySoftmax1HotWithBiasDx
0.1% 99.8% 0.078s 7.81e-05s C 1000 1 theano.sandbox.cuda.basic_ops.HostFromGpu
0.1% 99.9% 0.047s 9.38e-06s C 5000 5 theano.sandbox.cuda.dnn.GpuDnnConvDesc
0.1% 100.0% 0.047s 4.69e-06s C 10000 10 theano.tensor.opt.MakeVector
0.0% 100.0% 0.016s 2.23e-06s C 7000 7 theano.tensor.elemwise.Elemwise
0.0% 100.0% 0.000s 0.00e+00s C 13000 13 theano.compile.ops.Shape_i
0.0% 100.0% 0.000s 0.00e+00s C 11000 11 theano.sandbox.cuda.basic_ops.GpuContiguous
... (remaining 2 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
23.8% 23.8% 14.699s 2.94e-03s C 5000 5 GpuAllocEmpty
23.2% 46.9% 14.317s 7.16e-03s C 2000 2 GpuDnnPoolGrad{mode='max'}
19.7% 66.6% 12.188s 6.09e-03s C 2000 2 GpuElemwise{Composite{(i0 * (i1 + Abs(i1)))},no_inplace}
6.7% 73.3% 4.112s 2.06e-03s C 2000 2 GpuDnnPool{mode='max'}
3.5% 76.7% 2.142s 2.14e-03s C 1000 1 GpuDnnConvGradI{algo='none', inplace=True}
2.8% 79.6% 1.762s 5.87e-04s C 3000 3 GpuElemwise{Add}[(0, 0)]
2.8% 82.4% 1.752s 8.76e-04s C 2000 2 GpuDnnConvGradW{algo='none', inplace=True}
2.5% 84.9% 1.535s 3.84e-04s C 4000 4 GpuDot22
2.5% 87.4% 1.531s 7.66e-04s C 2000 2 GpuDnnConv{algo='small', inplace=True}
2.2% 89.6% 1.391s 6.96e-04s C 2000 2 GpuElemwise{Composite{((i0 * i1) + (i0 * i1 * sgn(i2)))}}[(0, 1)]
1.6% 91.3% 1.016s 5.08e-04s C 2000 2 GpuCAReduce{add}{1,0,1,1}
1.3% 92.6% 0.815s 4.07e-04s C 2000 2 GpuDot22Scalar
1.2% 93.8% 0.737s 2.46e-04s C 3000 3 GpuFromHost
1.0% 94.7% 0.595s 5.95e-04s C 1000 1 GpuElemwise{mul,no_inplace}
0.9% 95.6% 0.531s 6.64e-05s C 8000 8 GpuElemwise{Composite{((i0 * i1) + (i2 - i3))}}[(0, 2)]
0.9% 96.5% 0.531s 6.64e-05s C 8000 8 GpuElemwise{Composite{((i0 * i1) - i2)}}[(0, 1)]
0.7% 97.2% 0.423s 4.23e-04s C 1000 1 GPU_mrg_uniform{CudaNdarrayType(float32, 4D),inplace}
0.5% 97.7% 0.330s 3.30e-04s C 1000 1 GpuElemwise{Composite{((i0 + Abs(i0)) * i1)},no_inplace}
0.5% 98.2% 0.306s 6.11e-05s C 5000 5 GpuElemwise{Mul}[(0, 1)]
0.4% 98.6% 0.235s 1.17e-04s C 2000 2 GpuCAReduce{add}{1,0}
... (remaining 27 Ops account for 1.44%(0.89s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
21.1% 21.1% 13.061s 1.31e-02s 1000 114 GpuDnnPoolGrad{mode='max'}(GpuContiguous.0, GpuContiguous.0, GpuContiguous.0, TensorConstant{(2L,) of 2}, TensorConstant{(2L,) of 2}, TensorConstant{(2L,) of 0})
17.2% 38.3% 10.613s 1.06e-02s 1000 46 GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + i1) - i2) // i3) + i3)}}[(0, 0)].0, Elemwise{Composite{((((i0 + i1) - i2) // i3) + i3)}}[(0, 0)].0)
16.9% 55.2% 10.458s 1.05e-02s 1000 56 GpuElemwise{Composite{(i0 * (i1 + Abs(i1)))},no_inplace}(CudaNdarrayConstant{[[[[ 0.5]]]]}, GpuElemwise{Add}[(0, 0)].0)
5.8% 61.0% 3.596s 3.60e-03s 1000 60 GpuDnnPool{mode='max'}(GpuContiguous.0, TensorConstant{(2L,) of 2}, TensorConstant{(2L,) of 2}, TensorConstant{(2L,) of 0})
4.3% 65.3% 2.659s 2.66e-03s 1000 33 GpuAllocEmpty(Shape_i{0}.0, Shape_i{1}.0, TensorConstant{12}, TensorConstant{12})
3.5% 68.8% 2.142s 2.14e-03s 1000 108 GpuDnnConvGradI{algo='none', inplace=True}(GpuContiguous.0, GpuContiguous.0, GpuAllocEmpty.0, GpuDnnConvDesc{border_mode='valid', subsample=(1, 1), conv_mode='conv', precision='float32'}.0, Constant{1.0}, Constant{0.0})
2.8% 71.6% 1.730s 1.73e-03s 1000 64 GpuElemwise{Composite{(i0 * (i1 + Abs(i1)))},no_inplace}(CudaNdarrayConstant{[[[[ 0.5]]]]}, GpuElemwise{Add}[(0, 0)].0)
2.3% 73.9% 1.438s 1.44e-03s 1000 53 GpuElemwise{Add}[(0, 0)](GpuDnnConv{algo='small', inplace=True}.0, GpuDimShuffle{x,0,x,x}.0)
2.2% 76.1% 1.365s 1.36e-03s 1000 47 GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{(((i0 - i1) // i2) + i2)}}[(0, 1)].0, Elemwise{Composite{(((i0 - i1) // i2) + i2)}}[(0, 1)].0)
2.0% 78.1% 1.256s 1.26e-03s 1000 102 GpuDnnPoolGrad{mode='max'}(GpuContiguous.0, GpuContiguous.0, GpuContiguous.0, TensorConstant{(2L,) of 2}, TensorConstant{(2L,) of 2}, TensorConstant{(2L,) of 0})
1.9% 80.1% 1.188s 1.19e-03s 1000 115 GpuElemwise{Composite{((i0 * i1) + (i0 * i1 * sgn(i2)))}}[(0, 1)](CudaNdarrayConstant{[[[[ 0.5]]]]}, GpuDnnPoolGrad{mode='max'}.0, GpuElemwise{Add}[(0, 0)].0)
1.5% 81.5% 0.907s 9.07e-04s 1000 119 GpuDnnConvGradW{algo='none', inplace=True}(GpuContiguous.0, GpuContiguous.0, GpuAllocEmpty.0, GpuDnnConvDesc{border_mode='valid', subsample=(1, 1), conv_mode='conv', precision='float32'}.0, Constant{0.01}, Constant{0.0})
1.5% 83.0% 0.906s 9.06e-04s 1000 62 GpuDnnConv{algo='small', inplace=True}(GpuContiguous.0, GpuContiguous.0, GpuAllocEmpty.0, GpuDnnConvDesc{border_mode='valid', subsample=(1, 1), conv_mode='conv', precision='float32'}.0, Constant{1.0}, Constant{0.0})
1.4% 84.4% 0.891s 8.91e-04s 1000 116 GpuCAReduce{add}{1,0,1,1}(GpuElemwise{Composite{((i0 * i1) + (i0 * i1 * sgn(i2)))}}[(0, 1)].0)
1.4% 85.8% 0.845s 8.45e-04s 1000 107 GpuDnnConvGradW{algo='none', inplace=True}(GpuContiguous.0, GpuContiguous.0, GpuAllocEmpty.0, GpuDnnConvDesc{border_mode='valid', subsample=(1, 1), conv_mode='conv', precision='float32'}.0, Constant{0.01}, Constant{0.0})
1.2% 87.0% 0.722s 7.22e-04s 1000 8 GpuFromHost(inputs)
1.2% 88.1% 0.720s 7.20e-04s 1000 70 GpuDot22(GpuReshape{2}.0, W)
1.0% 89.1% 0.625s 6.25e-04s 1000 50 GpuDnnConv{algo='small', inplace=True}(GpuContiguous.0, GpuContiguous.0, GpuAllocEmpty.0, GpuDnnConvDesc{border_mode='valid', subsample=(1, 1), conv_mode='conv', precision='float32'}.0, Constant{1.0}, Constant{0.0})
1.0% 90.1% 0.595s 5.95e-04s 1000 67 GpuElemwise{mul,no_inplace}(CudaNdarrayConstant{[[[[ 2.]]]]}, GpuDnnPool{mode='max'}.0, GpuElemwise{Composite{Cast{float32}(LT(i0, i1))}}[(0, 0)].0)
0.9% 91.0% 0.565s 5.65e-04s 1000 91 GpuDot22Scalar(GpuDimShuffle{1,0}.0, GpuElemwise{Composite{((i0 * i1) + (i0 * i1 * sgn(i2)))}}[(0, 0)].0, TensorConstant{0.00999999977648})
... (remaining 104 Apply instances account for 8.99%(5.56s) of the runtime)
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide a speedup.
Sorry, no tip for today.
Function profiling
==================
Message: mnist.py:286
Time in 220 calls to Function.__call__: 6.842000e+00s
Time in Function.fn.__call__: 6.810000e+00s (99.532%)
Time in thunks: 4.580976e+00s (66.954%)
Total compile time: 4.219999e-01s
Number of Apply nodes: 60
Theano Optimizer time: 3.130000e-01s
Theano validate time: 1.600003e-02s
Theano Linker time (includes C, CUDA code generation/compiling): 7.800007e-02s
Import time 0.000000e+00s
Time in all call to theano.grad() 3.100014e-02s
Time since theano import 113.055s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
57.5% 57.5% 2.635s 5.99e-03s C 440 2 theano.sandbox.cuda.basic_ops.GpuAllocEmpty
19.8% 77.4% 0.909s 2.07e-03s C 440 2 theano.sandbox.cuda.dnn.GpuDnnPool
8.2% 85.5% 0.375s 3.41e-04s C 1100 5 theano.sandbox.cuda.basic_ops.GpuElemwise
4.6% 90.1% 0.209s 3.17e-04s C 660 3 theano.sandbox.cuda.basic_ops.GpuFromHost
4.1% 94.2% 0.188s 4.26e-04s C 440 2 theano.sandbox.cuda.dnn.GpuDnnConv
3.8% 98.0% 0.172s 3.91e-04s C 440 2 theano.sandbox.cuda.blas.GpuDot22
0.7% 98.6% 0.031s 1.42e-04s C 220 1 theano.sandbox.cuda.basic_ops.GpuCAReduce
0.3% 99.0% 0.016s 7.10e-05s C 220 1 theano.sandbox.cuda.nnet.GpuCrossentropySoftmaxArgmax1HotWithBias
0.3% 99.3% 0.016s 1.78e-05s C 880 4 theano.sandbox.cuda.basic_ops.GpuDimShuffle
0.3% 99.7% 0.016s 3.55e-05s C 440 2 theano.sandbox.cuda.basic_ops.HostFromGpu
0.3% 100.0% 0.016s 7.10e-05s C 220 1 theano.tensor.basic.MaxAndArgmax
0.0% 100.0% 0.000s 0.00e+00s C 2640 12 theano.compile.ops.Shape_i
0.0% 100.0% 0.000s 0.00e+00s C 1760 8 theano.tensor.elemwise.Elemwise
0.0% 100.0% 0.000s 0.00e+00s C 1320 6 theano.sandbox.cuda.basic_ops.GpuContiguous
0.0% 100.0% 0.000s 0.00e+00s C 1100 5 theano.tensor.opt.MakeVector
0.0% 100.0% 0.000s 0.00e+00s C 440 2 theano.sandbox.cuda.dnn.GpuDnnConvDesc
0.0% 100.0% 0.000s 0.00e+00s C 220 1 theano.sandbox.cuda.basic_ops.GpuReshape
0.0% 100.0% 0.000s 0.00e+00s C 220 1 theano.tensor.elemwise.Sum
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
57.5% 57.5% 2.635s 5.99e-03s C 440 2 GpuAllocEmpty
19.8% 77.4% 0.909s 2.07e-03s C 440 2 GpuDnnPool{mode='max'}
7.5% 84.9% 0.344s 5.21e-04s C 660 3 GpuElemwise{Composite{(i0 * ((i1 + i2) + Abs((i1 + i2))))}}[(0, 1)]
4.6% 89.4% 0.209s 3.17e-04s C 660 3 GpuFromHost
4.1% 93.5% 0.188s 4.26e-04s C 440 2 GpuDnnConv{algo='small', inplace=True}
3.8% 97.3% 0.172s 3.91e-04s C 440 2 GpuDot22
0.7% 98.0% 0.031s 1.42e-04s C 220 1 GpuCAReduce{add}{1}
0.7% 98.6% 0.031s 1.42e-04s C 220 1 GpuElemwise{Add}[(0, 0)]
0.3% 99.0% 0.016s 7.10e-05s C 220 1 GpuCrossentropySoftmaxArgmax1HotWithBias
0.3% 99.3% 0.016s 3.55e-05s C 440 2 HostFromGpu
0.3% 99.7% 0.016s 3.55e-05s C 440 2 GpuDimShuffle{x,0}
0.3% 100.0% 0.016s 7.10e-05s C 220 1 MaxAndArgmax
0.0% 100.0% 0.000s 0.00e+00s C 1320 6 GpuContiguous
0.0% 100.0% 0.000s 0.00e+00s C 1100 5 MakeVector{dtype='int64'}
0.0% 100.0% 0.000s 0.00e+00s C 660 3 Shape_i{0}
0.0% 100.0% 0.000s 0.00e+00s C 660 3 Shape_i{3}
0.0% 100.0% 0.000s 0.00e+00s C 660 3 Shape_i{1}
0.0% 100.0% 0.000s 0.00e+00s C 660 3 Shape_i{2}
0.0% 100.0% 0.000s 0.00e+00s C 440 2 Elemwise{Composite{((((i0 + i1) - i2) // i3) + i3)}}[(0, 0)]
0.0% 100.0% 0.000s 0.00e+00s C 440 2 GpuDnnConvDesc{border_mode='valid', subsample=(1, 1), conv_mode='conv', precision='float32'}
... (remaining 9 Ops account for 0.00%(0.00s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
51.7% 51.7% 2.367s 1.08e-02s 220 35 GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + i1) - i2) // i3) + i3)}}[(0, 0)].0, Elemwise{Composite{((((i0 + i1) - i2) // i3) + i3)}}[(0, 0)].0)
18.4% 70.1% 0.845s 3.84e-03s 220 40 GpuDnnPool{mode='max'}(GpuContiguous.0, TensorConstant{(2L,) of 2}, TensorConstant{(2L,) of 2}, TensorConstant{(2L,) of 0})
6.8% 76.9% 0.313s 1.42e-03s 220 38 GpuElemwise{Composite{(i0 * ((i1 + i2) + Abs((i1 + i2))))}}[(0, 1)](CudaNdarrayConstant{[[[[ 0.5]]]]}, GpuDnnConv{algo='small', inplace=True}.0, GpuDimShuffle{x,0,x,x}.0)
5.8% 82.8% 0.267s 1.22e-03s 220 36 GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{(((i0 - i1) // i2) + i2)}}[(0, 1)].0, Elemwise{Composite{(((i0 - i1) // i2) + i2)}}[(0, 1)].0)
4.6% 87.3% 0.209s 9.50e-04s 220 4 GpuFromHost(inputs)
3.4% 90.8% 0.156s 7.10e-04s 220 47 GpuDot22(GpuReshape{2}.0, W)
2.4% 93.1% 0.109s 4.97e-04s 220 37 GpuDnnConv{algo='small', inplace=True}(GpuContiguous.0, GpuContiguous.0, GpuAllocEmpty.0, GpuDnnConvDesc{border_mode='valid', subsample=(1, 1), conv_mode='conv', precision='float32'}.0, Constant{1.0}, Constant{0.0})
1.7% 94.9% 0.078s 3.55e-04s 220 42 GpuDnnConv{algo='small', inplace=True}(GpuContiguous.0, GpuContiguous.0, GpuAllocEmpty.0, GpuDnnConvDesc{border_mode='valid', subsample=(1, 1), conv_mode='conv', precision='float32'}.0, Constant{1.0}, Constant{0.0})
1.4% 96.2% 0.064s 2.91e-04s 220 45 GpuDnnPool{mode='max'}(GpuContiguous.0, TensorConstant{(2L,) of 2}, TensorConstant{(2L,) of 2}, TensorConstant{(2L,) of 0})
0.7% 96.9% 0.031s 1.42e-04s 220 52 GpuCAReduce{add}{1}(GpuCrossentropySoftmaxArgmax1HotWithBias.0)
0.7% 97.6% 0.031s 1.42e-04s 220 51 GpuElemwise{Add}[(0, 0)](GpuDot22.0, GpuDimShuffle{x,0}.0)
0.3% 98.0% 0.016s 7.10e-05s 220 50 GpuCrossentropySoftmaxArgmax1HotWithBias(GpuDot22.0, b, GpuFromHost.0)
0.3% 98.3% 0.016s 7.10e-05s 220 56 HostFromGpu(GpuElemwise{TrueDiv}[(0, 0)].0)
0.3% 98.6% 0.016s 7.10e-05s 220 49 GpuDot22(GpuElemwise{Composite{(i0 * ((i1 + i2) + Abs((i1 + i2))))}}[(0, 1)].0, W)
0.3% 99.0% 0.016s 7.10e-05s 220 48 GpuElemwise{Composite{(i0 * ((i1 + i2) + Abs((i1 + i2))))}}[(0, 1)](CudaNdarrayConstant{[[ 0.5]]}, GpuDot22.0, GpuDimShuffle{x,0}.0)
0.3% 99.3% 0.016s 7.10e-05s 220 7 GpuDimShuffle{x,0}(b)
0.3% 99.7% 0.016s 7.10e-05s 220 43 GpuElemwise{Composite{(i0 * ((i1 + i2) + Abs((i1 + i2))))}}[(0, 1)](CudaNdarrayConstant{[[[[ 0.5]]]]}, GpuDnnConv{algo='small', inplace=True}.0, GpuDimShuffle{x,0,x,x}.0)
0.3% 100.0% 0.016s 7.10e-05s 220 55 MaxAndArgmax(HostFromGpu.0, TensorConstant{(1L,) of 1})
0.0% 100.0% 0.000s 0.00e+00s 220 59 Elemwise{TrueDiv}[(0, 0)](Sum{acc_dtype=int64}.0, Elemwise{Composite{Cast{float32}(Switch(EQ(i0, i1), ((i2 * i0) // i0), i0))}}.0)
0.0% 100.0% 0.000s 0.00e+00s 220 58 Sum{acc_dtype=int64}(Elemwise{eq,no_inplace}.0)
... (remaining 40 Apply instances account for 0.00%(0.00s) of the runtime)
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide a speedup.
Sorry, no tip for today.
Function profiling
==================
Message: Sum of all(2) printed profiles at exit excluding Scan op profile.
Time in 1220 calls to Function.__call__: 9.741500e+01s
Time in Function.fn.__call__: 9.721100e+01s (99.791%)
Time in thunks: 6.642000e+01s (68.183%)
Total compile time: 2.016000e+00s
Number of Apply nodes: 124
Theano Optimizer time: 1.360000e+00s
Theano validate time: 9.400010e-02s
Theano Linker time (includes C, CUDA code generation/compiling): 5.780001e-01s
Import time 3.580000e-01s
Time in all call to theano.grad() 3.100014e-02s
Time since theano import 113.070s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
27.6% 27.6% 18.322s 5.08e-04s C 36100 40 theano.sandbox.cuda.basic_ops.GpuElemwise
26.1% 53.7% 17.334s 3.19e-03s C 5440 7 theano.sandbox.cuda.basic_ops.GpuAllocEmpty
21.6% 75.2% 14.317s 7.16e-03s C 2000 2 theano.sandbox.cuda.dnn.GpuDnnPoolGrad
7.6% 82.8% 5.022s 2.06e-03s C 2440 4 theano.sandbox.cuda.dnn.GpuDnnPool
3.2% 86.0% 2.142s 2.14e-03s C 1000 1 theano.sandbox.cuda.dnn.GpuDnnConvGradI
2.6% 88.7% 1.752s 8.76e-04s C 2000 2 theano.sandbox.cuda.dnn.GpuDnnConvGradW
2.6% 91.2% 1.719s 7.04e-04s C 2440 4 theano.sandbox.cuda.dnn.GpuDnnConv
2.6% 93.8% 1.706s 3.84e-04s C 4440 6 theano.sandbox.cuda.blas.GpuDot22
2.0% 95.8% 1.329s 2.55e-04s C 5220 6 theano.sandbox.cuda.basic_ops.GpuCAReduce
1.4% 97.2% 0.946s 2.59e-04s C 3660 6 theano.sandbox.cuda.basic_ops.GpuFromHost
1.2% 98.5% 0.815s 4.07e-04s C 2000 2 theano.sandbox.cuda.blas.GpuDot22Scalar
0.8% 99.3% 0.564s 2.82e-04s C 2000 2 theano.sandbox.rng_mrg.GPU_mrg_uniform
0.2% 99.5% 0.125s 1.02e-04s C 1220 2 theano.sandbox.cuda.nnet.GpuCrossentropySoftmaxArgmax1HotWithBias
0.1% 99.6% 0.094s 9.38e-05s C 1000 1 theano.sandbox.cuda.nnet.GpuCrossentropySoftmax1HotWithBiasDx
0.1% 99.8% 0.094s 6.51e-05s C 1440 3 theano.sandbox.cuda.basic_ops.HostFromGpu
0.1% 99.9% 0.047s 8.62e-06s C 5440 7 theano.sandbox.cuda.dnn.GpuDnnConvDesc
0.1% 99.9% 0.047s 4.22e-06s C 11100 15 theano.tensor.opt.MakeVector
0.0% 100.0% 0.016s 1.76e-06s C 8880 12 theano.sandbox.cuda.basic_ops.GpuDimShuffle
0.0% 100.0% 0.016s 1.78e-06s C 8760 15 theano.tensor.elemwise.Elemwise
0.0% 100.0% 0.016s 7.10e-05s C 220 1 theano.tensor.basic.MaxAndArgmax
... (remaining 4 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
26.1% 26.1% 17.334s 3.19e-03s C 5440 7 GpuAllocEmpty
21.6% 47.7% 14.317s 7.16e-03s C 2000 2 GpuDnnPoolGrad{mode='max'}
18.4% 66.0% 12.188s 6.09e-03s C 2000 2 GpuElemwise{Composite{(i0 * (i1 + Abs(i1)))},no_inplace}
7.6% 73.6% 5.022s 2.06e-03s C 2440 4 GpuDnnPool{mode='max'}
3.2% 76.8% 2.142s 2.14e-03s C 1000 1 GpuDnnConvGradI{algo='none', inplace=True}
2.7% 79.5% 1.793s 5.57e-04s C 3220 4 GpuElemwise{Add}[(0, 0)]
2.6% 82.1% 1.752s 8.76e-04s C 2000 2 GpuDnnConvGradW{algo='none', inplace=True}
2.6% 84.7% 1.719s 7.04e-04s C 2440 4 GpuDnnConv{algo='small', inplace=True}
2.6% 87.3% 1.706s 3.84e-04s C 4440 6 GpuDot22
2.1% 89.4% 1.391s 6.96e-04s C 2000 2 GpuElemwise{Composite{((i0 * i1) + (i0 * i1 * sgn(i2)))}}[(0, 1)]
1.5% 90.9% 1.016s 5.08e-04s C 2000 2 GpuCAReduce{add}{1,0,1,1}
1.4% 92.3% 0.946s 2.59e-04s C 3660 6 GpuFromHost
1.2% 93.6% 0.815s 4.07e-04s C 2000 2 GpuDot22Scalar
0.9% 94.5% 0.595s 5.95e-04s C 1000 1 GpuElemwise{mul,no_inplace}
0.8% 95.3% 0.531s 6.64e-05s C 8000 8 GpuElemwise{Composite{((i0 * i1) + (i2 - i3))}}[(0, 2)]
0.8% 96.1% 0.531s 6.64e-05s C 8000 8 GpuElemwise{Composite{((i0 * i1) - i2)}}[(0, 1)]
0.6% 96.7% 0.423s 4.23e-04s C 1000 1 GPU_mrg_uniform{CudaNdarrayType(float32, 4D),inplace}
0.5% 97.2% 0.344s 5.21e-04s C 660 3 GpuElemwise{Composite{(i0 * ((i1 + i2) + Abs((i1 + i2))))}}[(0, 1)]
0.5% 97.7% 0.330s 3.30e-04s C 1000 1 GpuElemwise{Composite{((i0 + Abs(i0)) * i1)},no_inplace}
0.5% 98.2% 0.306s 6.11e-05s C 5000 5 GpuElemwise{Mul}[(0, 1)]
... (remaining 33 Ops account for 1.84%(1.22s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
19.7% 19.7% 13.061s 1.31e-02s 1000 114 GpuDnnPoolGrad{mode='max'}(GpuContiguous.0, GpuContiguous.0, GpuContiguous.0, TensorConstant{(2L,) of 2}, TensorConstant{(2L,) of 2}, TensorConstant{(2L,) of 0})
16.0% 35.6% 10.613s 1.06e-02s 1000 46 GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + i1) - i2) // i3) + i3)}}[(0, 0)].0, Elemwise{Composite{((((i0 + i1) - i2) // i3) + i3)}}[(0, 0)].0)
15.7% 51.4% 10.458s 1.05e-02s 1000 56 GpuElemwise{Composite{(i0 * (i1 + Abs(i1)))},no_inplace}(CudaNdarrayConstant{[[[[ 0.5]]]]}, GpuElemwise{Add}[(0, 0)].0)
5.4% 56.8% 3.596s 3.60e-03s 1000 60 GpuDnnPool{mode='max'}(GpuContiguous.0, TensorConstant{(2L,) of 2}, TensorConstant{(2L,) of 2}, TensorConstant{(2L,) of 0})
4.0% 60.8% 2.659s 2.66e-03s 1000 33 GpuAllocEmpty(Shape_i{0}.0, Shape_i{1}.0, TensorConstant{12}, TensorConstant{12})
3.6% 64.4% 2.367s 1.08e-02s 220 35 GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + i1) - i2) // i3) + i3)}}[(0, 0)].0, Elemwise{Composite{((((i0 + i1) - i2) // i3) + i3)}}[(0, 0)].0)
3.2% 67.6% 2.142s 2.14e-03s 1000 108 GpuDnnConvGradI{algo='none', inplace=True}(GpuContiguous.0, GpuContiguous.0, GpuAllocEmpty.0, GpuDnnConvDesc{border_mode='valid', subsample=(1, 1), conv_mode='conv', precision='float32'}.0, Constant{1.0}, Constant{0.0})
2.6% 70.2% 1.730s 1.73e-03s 1000 64 GpuElemwise{Composite{(i0 * (i1 + Abs(i1)))},no_inplace}(CudaNdarrayConstant{[[[[ 0.5]]]]}, GpuElemwise{Add}[(0, 0)].0)
2.2% 72.4% 1.438s 1.44e-03s 1000 53 GpuElemwise{Add}[(0, 0)](GpuDnnConv{algo='small', inplace=True}.0, GpuDimShuffle{x,0,x,x}.0)
2.1% 74.4% 1.365s 1.36e-03s 1000 47 GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{(((i0 - i1) // i2) + i2)}}[(0, 1)].0, Elemwise{Composite{(((i0 - i1) // i2) + i2)}}[(0, 1)].0)
1.9% 76.3% 1.256s 1.26e-03s 1000 102 GpuDnnPoolGrad{mode='max'}(GpuContiguous.0, GpuContiguous.0, GpuContiguous.0, TensorConstant{(2L,) of 2}, TensorConstant{(2L,) of 2}, TensorConstant{(2L,) of 0})
1.8% 78.1% 1.188s 1.19e-03s 1000 115 GpuElemwise{Composite{((i0 * i1) + (i0 * i1 * sgn(i2)))}}[(0, 1)](CudaNdarrayConstant{[[[[ 0.5]]]]}, GpuDnnPoolGrad{mode='max'}.0, GpuElemwise{Add}[(0, 0)].0)
1.4% 79.5% 0.907s 9.07e-04s 1000 119 GpuDnnConvGradW{algo='none', inplace=True}(GpuContiguous.0, GpuContiguous.0, GpuAllocEmpty.0, GpuDnnConvDesc{border_mode='valid', subsample=(1, 1), conv_mode='conv', precision='float32'}.0, Constant{0.01}, Constant{0.0})
1.4% 80.8% 0.906s 9.06e-04s 1000 62 GpuDnnConv{algo='small', inplace=True}(GpuContiguous.0, GpuContiguous.0, GpuAllocEmpty.0, GpuDnnConvDesc{border_mode='valid', subsample=(1, 1), conv_mode='conv', precision='float32'}.0, Constant{1.0}, Constant{0.0})
1.3% 82.2% 0.891s 8.91e-04s 1000 116 GpuCAReduce{add}{1,0,1,1}(GpuElemwise{Composite{((i0 * i1) + (i0 * i1 * sgn(i2)))}}[(0, 1)].0)
1.3% 83.4% 0.845s 8.45e-04s 1000 107 GpuDnnConvGradW{algo='none', inplace=True}(GpuContiguous.0, GpuContiguous.0, GpuAllocEmpty.0, GpuDnnConvDesc{border_mode='valid', subsample=(1, 1), conv_mode='conv', precision='float32'}.0, Constant{0.01}, Constant{0.0})
1.3% 84.7% 0.845s 3.84e-03s 220 40 GpuDnnPool{mode='max'}(GpuContiguous.0, TensorConstant{(2L,) of 2}, TensorConstant{(2L,) of 2}, TensorConstant{(2L,) of 0})
1.1% 85.8% 0.722s 7.22e-04s 1000 8 GpuFromHost(inputs)
1.1% 86.9% 0.720s 7.20e-04s 1000 70 GpuDot22(GpuReshape{2}.0, W)
0.9% 87.8% 0.625s 6.25e-04s 1000 50 GpuDnnConv{algo='small', inplace=True}(GpuContiguous.0, GpuContiguous.0, GpuAllocEmpty.0, GpuDnnConvDesc{border_mode='valid', subsample=(1, 1), conv_mode='conv', precision='float32'}.0, Constant{1.0}, Constant{0.0})
... (remaining 164 Apply instances account for 12.17%(8.09s) of the runtime)
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide a speedup.
Sorry, no tip for today.
Hey guys, I aggressively change cnmem=0.8,
[global]
floatX = float32
device = gpu2
[dnn.conv]
algo_fwd = time_once
algo_bwd_data = time_once
algo_bwd_filter = time_once
[lib]
cnmem = .8
The time one epoch went down to 76 seconds!
This is crazy!
this is the profile output for mnist.py using the flags you suggested:
Interesting. The first function profile is the one for the training function. It seems that not only GpuAllocEmpty
is the culprit: it only accounts for 24% of the time, while you see a speedup of 95%. It rather looks like all Ops that do some allocation are affected. I can't reproduce this on my machine, so maybe this is specific to Windows.
@nouiz: Are there still plans to make lib.cnmem=.45
the default? It seems this is really critical on Windows.
Hey guys, I aggressively change cnmem=0.8, The time one epoch went down to 76 seconds!
This should only make a difference for the very first epoch. If you set it to a smaller fraction of memory than ultimately needed, it will have to do extra allocations when it hits the boundary of reserved memory. After the first epoch, the memory pool should be large enough for all subsequent epochs so there won't be any more allocations (from the perspective of the driver). If you set it to a larger value than ultimately needed, your process blocks some GPU memory that could have been used by other processes (to run multiple experiments at the same time, on the same device).
You can try to add allow_gc = 0
to the [global]
section to see if it makes things a little bit faster still.
The new back-end have another allocator that is enabled and preallocate 0 by default, but don't release the memory to the driver by default.
So the transition should fix that. There is a flag to make it reserve more memory at the start.
On Thu, May 19, 2016 at 12:44 PM, Jan Schlüter notifications@github.com wrote:
this is the profile output for mnist.py using the flags you suggested:
Interesting. The first function profile is the one for the training function. It seems that not only GpuAllocEmpty is the culprit: it only accounts for 24% of the time, while you see a speedup of 95%. It rather looks like all Ops that do some allocation are affected. I can't reproduce this on my machine, so maybe this is specific to Windows. @nouiz https://github.com/nouiz: Are there still plans to make lib.cnmem=.45 the default? It seems this is really critical on Windows.
Hey guys, I aggressively change cnmem=0.8, The time one epoch went down to 76 seconds!
This should only make a difference for the very first epoch. If you set it to a smaller fraction of memory than ultimately needed, it will have to do extra allocations when it hits the boundary of reserved memory. After the first epoch, the memory pool should be large enough for all subsequent epochs so there won't be any more allocations (from the perspective of the driver). If you set it to a larger value than ultimately needed, your process blocks some GPU memory that could have been used by other processes (to run multiple experiments at the same time, on the same device).
You can try to add allow_gc=0 to the [global] section to see if it makes things a little bit faster still.
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/Lasagne/Recipes/issues/57#issuecomment-220383209
add allow_gc = 0
to the [global]
section can still speed up a bit.
[global]
allow_gc = 0
floatX = float32
device = gpu2
[dnn.conv]
algo_fwd = time_once
algo_bwd_data = time_once
algo_bwd_filter = time_once
[lib]
cnmem = .45
Now the time for one epoch went down to 74 seconds! Cheers!
I am using Ubuntu 14.04 with a titan x GPU and cudnn v4. When I run the Deep_Residual_Learning_CIFAR-10.py script The results are as follows
Is this running speed normal?