Performance regression on training resnet152 with CIFAR10 on CPU

roywei commented 5 years ago

Follow up on dev list discussion:

https://lists.apache.org/thread.html/154ef1e4010671e7375c7a7cbedb413d5a4a3677321488440fb32a3a@%3Cdev.mxnet.apache.org%3E

We have found resnet152 to have a regression when training CIFAR10 dataset on CPU (C5x18Large)

To summarize the findings:

Scripts/Model: https://github.com/awslabs/deeplearning-benchmark/blob/master/dawnbench/cifar10.py Total 20 epochs. First 10 epochs for warm-up

With MXNet 1.4.1 average time is 164.23 s With MXNet 1.5.0 average time is 174.59 s (~6.3% regression) (1.5.0 version: pip install mxnet-mkl==1.5.0b20190619 which corresponds to commit# ccbbf6b4b76ea536a6583c99497c83b65a20817b which is behind 1.5.x branch by 4 commits)

If total 50 epochs, first 10 epoch warm up and run with fixed seed: 1.4.1: 164.95 s 1.5.0: 170.44 s Detailed data at [1] This is about 3% regression (1.5.0 version: 1.5.0rc2 release candidate build from source with MKLDNN )

Gluon Resnet Model: Gluon speed test benchmark script - https://github.com/apache/incubator-mxnet/blob/master/benchmark/python/gluon/benchmark_gluon.py using the following command: python3 benchmark_gluon.py --model 'resnet152_v2' --batch-size 128 --num-batches 200 --type 'training'

I got the following speeds: With MXNet 1.4.1, average speed is 25.677534 img/s With MXNet 1.5.0, average speed is 25.082130 img/s (~2.3% regression)

mxnet-label-bot commented 5 years ago

Hey, this is the MXNet Label Bot. Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it. Here are my recommended labels: Performance

pengzhao-intel commented 5 years ago

Thanks to summarizing the issue. @ciyongch is WIP for this task.

apache / mxnet

Performance regression on training resnet152 with CIFAR10 on CPU #15430