wkcn commented 5 years ago

Description

Hi! I have an experiment about Object Counting, which needs variant inputs. I write the code with Gluon, and hybridize the model with static_alloc=True I found there is obvious difference between MXNet 1.5.0 and MXNet 1.3.0, and I checked it on two servers.

~~I think the method of memory allocation for Gluon may be changed after MXNet 1.3.0.~~

Thanks!

Update: When there are dilated Convolutional layers in the model, and the input size is variational, the performance will drop. I think it may be related to one of the two PRs: #11742 #12722

Environment info (Required)

OS: ubuntu 14.04 GPU: Tesla M40 x 4

Minimum reproducible example

I write a minimum reproducible example without dataset. Code

Performance for test code [a fully convolutional model(vgg16 without FC layer), variant inputs]: MXNet 1.5.0: 10 images / sec MXNet 1.3.0: 40+ images / sec

The performances are the same when input shape is fixed.

Input shape: (9, 3, 300\~512, 300\~512) in NCHW order

Package used (Python/R/Scala/Julia): Python 2.7.12, 3.7.1

MXNet is installed by pip:

# MXNet 1.5.0
pip install mxnet-cu80 --pre
# MXNet 1.3.0
pip install mxnet-cu80==1.3.0

Steps to reproduce

Download the test code. Run the test code in different version (1.3.0 and 1.5.0) of MXNet.

Performance

I test several versions of MXNet.

version	performance
1.4.0b20181207	slow
1.3.1b20181101	slow
1.3.1b20181010	slow
1.3.1b20181004	fast
1.3.1b20181001	fast

Some pre-build versions don't support CUDA9.0, so I cound't test it. The performance drops during 20181004 to 20181010.

If changing the dilation of dilated conv to 1, the performance will be normal. It seems the problem occurs in dilated conv.

piyushghai commented 5 years ago

@wkcn Thanks for raising this issue. The performance degradation is indeed concerning. I'm labelling it so that the other community members can have a look at it.

@mxnet-label-bot Add [Gluon, Performance]

@szha Any thoughts here ?

zhreshold commented 5 years ago

@wkcn Performance: MXNet 1.5.0: 20 images / sec MXNet 1.3.0: 70+ images / sec

What are these numbers specifically? Training speed for Faster-RCNN? If so, what is the network?

adaaaaaa commented 5 years ago

what is the different between 1.3.0 and 1.5.0 in memory allocation?

wkcn commented 5 years ago

@piyushghai Thanks. @zhreshold In my experiment, it's a fully convolutional network model(vgg16 without FC layers), whose inputs are variant. The performance I show is a fully convolutional network, not faster r-cnn model. I guess that the performance of faster r-cnn is also dropped in MXNrt 1.5.0. I will check the performance of faster r-cnn, or write a minimum reproduce example.

wkcn commented 5 years ago

@adaaaaaa I don't know. I found the speeds are the same between two versions when input shapes are fixed. In my code, I call 'hybridize()' first, then call 'hybridize(static_alloc=True)'.

szha commented 5 years ago

what are the typical input sizes?

wkcn commented 5 years ago

@szha In my experiment, the input size is (9,3,300 to 512,300 to 512), 9 is the batch size and 3 is the number of channels. I will write a minimum reproduce example later.

wkcn commented 5 years ago

@zhreshold @szha Hello! I have written a minimum reproducible example which doesn't need dataset. Code

I test it on a machine which owns Tesla M40 (22945MiB) x 4. Here is the result: MXNet 1.5.0: 10 images / sec MXNet 1.3.0: 40+ images / sec

MXNet is installed by pip install mxnet-cu90 --pre or pip install mxnet-cu90==1.3.0

I test several versions of MXNet.

version	performance
1.4.0b20181207	slow
1.3.1b20181101	slow
1.3.1b20181010	slow
1.3.1b20181004	fast
1.3.1b20181001	fast

Some pre-build versions don't support CUDA9.0, so I cound't test it. The performance drops during 20181004 to 20181010.

zhreshold commented 5 years ago

@wkcn I've tested it using V100 x4 there's no visual difference between 1.3.1 release, 1.4.0b20181207 and 1.5.0b20190122 nightly, both around 140(+-20) images/sec

Actually I tested 1.3.1b20181001, it is slower (120+-20 images/sec on average) than any of the previous three builds. In summary, my experimental results are reversed version of @wkcn 's results.

wkcn commented 5 years ago

@zhreshold Thank you！

It’s flaky. I test it on the server with Ubuntu 14.04, Tesla M40(24G) x 4, CUDA 9.0. When I remove all dilated convolutions （the dilation of convolution is greater than 1），there will be no obvious difference between MXNet 1.3 and 1.5

wkcn commented 5 years ago

@zhreshold I test it on the server with Ubuntu 14.04, Tesla M40(24G) x 4, CUDA 8.0 just now. The training speed is 40+ samples/sec.

I think the performance drops because of driver rather than MXNet. The CUDA 9.0 driver installed on the server is not matched with latest MXNet.

zhreshold commented 5 years ago

@wkcn

THanks for the update, can be resolve this issue?

wkcn commented 5 years ago

@zhreshold solved. Thank you！

mikeobr commented 5 years ago

@wkcn

The CUDA 9.0 driver installed on the server is not matched with latest MXNet. What exactly did you check to diagnose this?

I'm currently seeing some of my inference stuff slowing down a lot on mxnet versions over 1.3.1 with cuda 9.2 (run on a docker container), but I do not know how to check if it is the same thing you ran into.

wkcn commented 5 years ago

@mikeobr You can run this code: https://gist.githubusercontent.com/wkcn/69f0f6d2ca467816dc481a00c225104f/raw/2899896f42a920ff0fde5ff93b9a16d16aec507f/test_fcn_for_mxnet.py

It seems that the performance of dilated Convolutional layer drops in CUDA 9.

PapaMadeleine2022 commented 5 years ago

Hello, I have a problem: I use libmxnet.so compiled with mxnetv0.7 comparison to mxnetv1.0(or v1.3 and v1.4) to run my code to infer a batch images, I find the inference speed with mxnet of higher version is slower than mxnetv0.7. What causes this problem? How to fix it? anyone can give some advises?

envs: p40/cuda8/cudnn5.1.10/nvidia-driver384.81

wkcn commented 5 years ago

@IvyGongoogle Is there any dilated convolutional layers in your model?

vc384 commented 5 years ago

I met the same problem, I have a project with dilation convolution (resnet backbone). If I use a mxnet1.3.1-cu80 (pip install), the speed is 0.18-0.19s one iter. However, when I switch to mxnet1.4.0-cute80(pip install), the speed drop to 0.19-0.20s one iter. The speed drop slightly, I confused with this problem. OS: Ubuntu 16.04 Driver: 384.130 CUDA: 8.0 cudnn: maybe 7.4.1 or 6.0.21

wkcn commented 5 years ago

Could anyone try MXNET_CUDA_TENSOR_OP_MATH_ALLOW_CONVERSION=1 python test.py ? There are some PRs which may be related to the issue:

11742 #12722

PapaMadeleine2022 commented 5 years ago

@wkcn no dilated convolutional layers in my model which is a ocr recognition model with simple cnn and rnn

chinakook commented 5 years ago

Based on experience, you should use newer version of CUDA and CUDNN to get more performance. In my opinion, cuda 8.0 is obsoleted. ps: Dilated convolution is not optimized in the old CUDNN version( < 6.0 or maybe 6.5 ) .

wkcn commented 5 years ago

Close it since dilated convolution is not optimized in the old version of CUDNN as chinakook said.

apache / mxnet

MXNet 1.5.0 is slower than 1.3.0 when intputs are variant #13928

Description

Environment info (Required)

Minimum reproducible example

Steps to reproduce

Performance

11742 #12722