Open karan6181 opened 6 years ago
We need to also investigate why cpu inference is fast while training is slow in CNN
Also,
I Found that CNN performance is slower on CPU when using channels_first
data format as compared to channels_last
using Keras+MXNet. Below are the details:
Example: keras/example/cifar10_cnn.py
Python: 2.7.12
MXNet: 1.2.0(MXNet-mkl)
Keras: 2.1.5
Instance: Amazon AWS c5.xLarge
Description: Ubuntu 16.04.4 LTS
Results:
Data format | Avg. Training time/epoch |
---|---|
MXNet channels_last | 191s - 122ms/step |
MXNet channels_first | 214s - 137ms/step |
I did profiling for the same and found that the execution time took by broadcast_add()
operator varies when data format is channels_last
and channels_first
.
broadcast_add()
:
Forward propagation: channels_first
data is 1.5x
times slower than channels_last
data
Backpropagation: channels_first
data is 2x
times slower than channels_last
data
Filename: keras/example/cifar10_cnn.py
Note: For channels_last
data format, your keras.json file should have "image_data_format"
as "channels_last"
and "backend"
as "mxnet"
Execution time during Forward propagation of broadcast_add()
operator:
Channels_last | Channels_first |
---|---|
0.819 ms | 1.253 ms |
0.825 ms | 1.261 ms |
0.822 ms | 1.260 ms |
Execution time during backpropagation of broadcast_add()
operator:
Channels_last | Channels_first |
---|---|
4.058 ms | 8.368 ms |
4.056 ms | 8.367 ms |
6.562 ms | 8.450 ms |
Ideally, CNN performance with MXNet channels_first
data format should be better as compared to channels_last
.
MXNet broadcast_add operator is significantly slower on CPU - https://github.com/apache/incubator-mxnet/issues/8219
Also, I did some rough calculations on channels_first
and channels_last
performance and by how much the performace improved if broadcast_add
operator gives similar performance on channels_first
as compared to channels_last
.
MXNet channels_first takes 214 sec/epoch(From above analysis)
From profiling, I got rough estimation that one loop takes 350 msec(see below diagram).
So in 1 epoch, this model loop runs. 214 sec / 350 msec = 611 = 600(Approx.)
It run 600 times if we ignore initialization and termination time.
In one loop, the amount of time taken by broadcast_add()
operator when data is channels_first
is
Forward prop: 3.8 msec
Backward prob: 47.9 msec
Total: 51.7 msec
In one loop, the amount of time taken by broadcast_add()
operator when data is channels_last
is
Forward prop: 2.5 msec
Backward prob: 23.1 msec
Total: 25.6 msec
If we assume that the broadcast_add()
gives similar performance on channels_first
and channels_last
, then the amount of time to execute one loop is 350 - 51.7 + 25.6 = 323.9
.
So, the amount of time took by the model in one epoch is:
OLD: 600 * 350 = 210 sec
NEW: 600 * 323 = 194 sec
We get roughly 16 sec boost in performance if broadcast_add()
gives similar performance on channels_first
and channels_last
.
Note: This analysis is done on keras/example/cifar10_cnn.py
and the configurations is same as my above post.
May be related to mx.sym.dot() operator performance issue - https://github.com/apache/incubator-mxnet/issues/10881
Hi,
I am running the
keras/examples/mnist_cnn.py
andkeras/exaples/cifar10_cnn.py
and I can see that the training time it took on every epoch on Keras using MXNet backend is higher as compared to Keras using Tensorflow backend when ran on CPU.The image data is already a
channels_first
when using MXNet backend andchannels_last
when using TensorFlow backend, which means, no transpose overhead operation.Below are the pieces of information:
Machine: MacBook Pro(2.5GHz intel core i7 and 16 GB 2133 MHz RAM)
Python version: 2.7.14
MXNet version: 1.1.0
Tensorflow version: 1.5.0
Keras version: 2.1.4
Results:
Thank-You!