awslabs / keras-apache-mxnet

[DEPRECATED] Amazon Deep Learning's Keras with Apache MXNet support
https://github.com/awslabs/keras-apache-mxnet/wiki
Other
289 stars 65 forks source link

Keras-MXNet CNN for training on CPU is relatively slower #56

Open karan6181 opened 6 years ago

karan6181 commented 6 years ago

Hi,

I am running the keras/examples/mnist_cnn.py and keras/exaples/cifar10_cnn.py and I can see that the training time it took on every epoch on Keras using MXNet backend is higher as compared to Keras using Tensorflow backend when ran on CPU.

The image data is already a channels_first when using MXNet backend and channels_last when using TensorFlow backend, which means, no transpose overhead operation.

Below are the pieces of information:

Machine: MacBook Pro(2.5GHz intel core i7 and 16 GB 2133 MHz RAM)

Python version: 2.7.14

MXNet version: 1.1.0

Tensorflow version: 1.5.0

Keras version: 2.1.4

Results:

Backend mnist_cnn.py cifar10_cnn.py
Keras+MXNet training performance 272 sec/epoch 388 sec/epoch
Keras+TensorFlow training performance 150 sec/epoch 239 sec/epoch

Thank-You!

roywei commented 6 years ago

We need to also investigate why cpu inference is fast while training is slow in CNN

karan6181 commented 6 years ago

Also,

I Found that CNN performance is slower on CPU when using channels_first data format as compared to channels_last using Keras+MXNet. Below are the details:

Example: keras/example/cifar10_cnn.py

Python: 2.7.12

MXNet: 1.2.0(MXNet-mkl)

Keras: 2.1.5

Instance: Amazon AWS c5.xLarge

Description: Ubuntu 16.04.4 LTS

Results:

Data format Avg. Training time/epoch
MXNet channels_last 191s - 122ms/step
MXNet channels_first 214s - 137ms/step

I did profiling for the same and found that the execution time took by broadcast_add() operator varies when data format is channels_last and channels_first.

broadcast_add():

Forward propagation: channels_first data is 1.5x times slower than channels_last data

Backpropagation: channels_first data is 2x times slower than channels_last data


Filename: keras/example/cifar10_cnn.py

Note: For channels_last data format, your keras.json file should have "image_data_format" as "channels_last" and "backend" as "mxnet"

Execution time during Forward propagation of broadcast_add() operator:

Channels_last Channels_first
0.819 ms 1.253 ms
0.825 ms 1.261 ms
0.822 ms 1.260 ms

Execution time during backpropagation of broadcast_add() operator:

Channels_last Channels_first
4.058 ms 8.368 ms
4.056 ms 8.367 ms
6.562 ms 8.450 ms

Ideally, CNN performance with MXNet channels_first data format should be better as compared to channels_last.

sandeep-krishnamurthy commented 6 years ago

MXNet broadcast_add operator is significantly slower on CPU - https://github.com/apache/incubator-mxnet/issues/8219

karan6181 commented 6 years ago

Also, I did some rough calculations on channels_first and channels_last performance and by how much the performace improved if broadcast_add operator gives similar performance on channels_first as compared to channels_last.

MXNet channels_first takes 214 sec/epoch(From above analysis)

From profiling, I got rough estimation that one loop takes 350 msec(see below diagram).

image2

So in 1 epoch, this model loop runs. 214 sec / 350 msec = 611 = 600(Approx.) It run 600 times if we ignore initialization and termination time.

In one loop, the amount of time taken by broadcast_add() operator when data is channels_first is

Forward prop: 3.8 msec

Backward prob: 47.9 msec

Total: 51.7 msec

In one loop, the amount of time taken by broadcast_add() operator when data is channels_last is

Forward prop: 2.5 msec

Backward prob: 23.1 msec

Total: 25.6 msec

If we assume that the broadcast_add() gives similar performance on channels_first and channels_last, then the amount of time to execute one loop is 350 - 51.7 + 25.6 = 323.9.

So, the amount of time took by the model in one epoch is:

OLD: 600 * 350 = 210 sec

NEW: 600 * 323 = 194 sec

We get roughly 16 sec boost in performance if broadcast_add() gives similar performance on channels_first and channels_last.

Note: This analysis is done on keras/example/cifar10_cnn.py and the configurations is same as my above post.

sandeep-krishnamurthy commented 6 years ago

May be related to mx.sym.dot() operator performance issue - https://github.com/apache/incubator-mxnet/issues/10881