Closed wkcn closed 5 years ago
@wkcn Thanks for raising this issue. The performance degradation is indeed concerning. I'm labelling it so that the other community members can have a look at it.
@mxnet-label-bot Add [Gluon, Performance]
@szha Any thoughts here ?
@wkcn Performance: MXNet 1.5.0: 20 images / sec MXNet 1.3.0: 70+ images / sec
What are these numbers specifically? Training speed for Faster-RCNN? If so, what is the network?
what is the different between 1.3.0 and 1.5.0 in memory allocation?
@piyushghai Thanks. @zhreshold In my experiment, it's a fully convolutional network model(vgg16 without FC layers), whose inputs are variant. The performance I show is a fully convolutional network, not faster r-cnn model. I guess that the performance of faster r-cnn is also dropped in MXNrt 1.5.0. I will check the performance of faster r-cnn, or write a minimum reproduce example.
@adaaaaaa I don't know. I found the speeds are the same between two versions when input shapes are fixed. In my code, I call 'hybridize()' first, then call 'hybridize(static_alloc=True)'.
what are the typical input sizes?
@szha In my experiment, the input size is (9,3,300 to 512,300 to 512), 9 is the batch size and 3 is the number of channels. I will write a minimum reproduce example later.
@zhreshold @szha Hello! I have written a minimum reproducible example which doesn't need dataset. Code
I test it on a machine which owns Tesla M40 (22945MiB) x 4. Here is the result: MXNet 1.5.0: 10 images / sec MXNet 1.3.0: 40+ images / sec
MXNet is installed by pip install mxnet-cu90 --pre
or pip install mxnet-cu90==1.3.0
I test several versions of MXNet.
version | performance |
---|---|
1.4.0b20181207 | slow |
1.3.1b20181101 | slow |
1.3.1b20181010 | slow |
1.3.1b20181004 | fast |
1.3.1b20181001 | fast |
Some pre-build versions don't support CUDA9.0, so I cound't test it. The performance drops during 20181004 to 20181010.
@wkcn I've tested it using V100 x4 there's no visual difference between 1.3.1 release, 1.4.0b20181207 and 1.5.0b20190122 nightly, both around 140(+-20) images/sec
Actually I tested 1.3.1b20181001, it is slower (120+-20 images/sec on average) than any of the previous three builds. In summary, my experimental results are reversed version of @wkcn 's results.
@zhreshold Thank you!
It’s flaky. I test it on the server with Ubuntu 14.04, Tesla M40(24G) x 4, CUDA 9.0. When I remove all dilated convolutions (the dilation of convolution is greater than 1),there will be no obvious difference between MXNet 1.3 and 1.5
@zhreshold I test it on the server with Ubuntu 14.04, Tesla M40(24G) x 4, CUDA 8.0 just now. The training speed is 40+ samples/sec.
I think the performance drops because of driver rather than MXNet. The CUDA 9.0 driver installed on the server is not matched with latest MXNet.
@wkcn
THanks for the update, can be resolve this issue?
@zhreshold solved. Thank you!
@wkcn
The CUDA 9.0 driver installed on the server is not matched with latest MXNet. What exactly did you check to diagnose this?
I'm currently seeing some of my inference stuff slowing down a lot on mxnet versions over 1.3.1 with cuda 9.2 (run on a docker container), but I do not know how to check if it is the same thing you ran into.
@mikeobr You can run this code: https://gist.githubusercontent.com/wkcn/69f0f6d2ca467816dc481a00c225104f/raw/2899896f42a920ff0fde5ff93b9a16d16aec507f/test_fcn_for_mxnet.py
It seems that the performance of dilated Convolutional layer drops in CUDA 9.
Hello, I have a problem: I use libmxnet.so compiled with mxnetv0.7 comparison to mxnetv1.0(or v1.3 and v1.4) to run my code to infer a batch images, I find the inference speed with mxnet of higher version is slower than mxnetv0.7. What causes this problem? How to fix it? anyone can give some advises?
envs: p40/cuda8/cudnn5.1.10/nvidia-driver384.81
@IvyGongoogle Is there any dilated convolutional layers in your model?
I met the same problem, I have a project with dilation convolution (resnet backbone). If I use a mxnet1.3.1-cu80 (pip install), the speed is 0.18-0.19s one iter. However, when I switch to mxnet1.4.0-cute80(pip install), the speed drop to 0.19-0.20s one iter. The speed drop slightly, I confused with this problem. OS: Ubuntu 16.04 Driver: 384.130 CUDA: 8.0 cudnn: maybe 7.4.1 or 6.0.21
Could anyone try MXNET_CUDA_TENSOR_OP_MATH_ALLOW_CONVERSION=1 python test.py ? There are some PRs which may be related to the issue:
@wkcn no dilated convolutional layers in my model which is a ocr recognition model with simple cnn and rnn
Based on experience, you should use newer version of CUDA and CUDNN to get more performance. In my opinion, cuda 8.0 is obsoleted. ps: Dilated convolution is not optimized in the old CUDNN version( < 6.0 or maybe 6.5 ) .
Close it since dilated convolution is not optimized in the old version of CUDNN as chinakook said.
Description
Hi! I have an experiment about Object Counting, which needs variant inputs. I write the code with Gluon, and hybridize the model with
static_alloc=True
I found there is obvious difference between MXNet 1.5.0 and MXNet 1.3.0, and I checked it on two servers.I think the method of memory allocation for Gluon may be changed after MXNet 1.3.0.Thanks!
Update: When there are dilated Convolutional layers in the model, and the input size is variational, the performance will drop. I think it may be related to one of the two PRs: #11742 #12722
Environment info (Required)
OS: ubuntu 14.04 GPU: Tesla M40 x 4
Minimum reproducible example
I write a minimum reproducible example without dataset. Code
The performances are the same when input shape is fixed.
Input shape: (9, 3, 300\~512, 300\~512) in NCHW order
Package used (Python/R/Scala/Julia): Python 2.7.12, 3.7.1
MXNet is installed by pip:
Steps to reproduce
Download the test code. Run the test code in different version (1.3.0 and 1.5.0) of MXNet.
Performance
I test several versions of MXNet.
Some pre-build versions don't support CUDA9.0, so I cound't test it. The performance drops during 20181004 to 20181010.
If changing the dilation of dilated conv to 1, the performance will be normal. It seems the problem occurs in dilated conv.