The amp performance of MxNet is worse than expected.

apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more

https://mxnet.apache.org

Apache License 2.0

20.78k stars 6.79k forks source link

The amp performance of MxNet is worse than expected. #19052

Open wzzju opened 4 years ago

wzzju commented 4 years ago

Description

According to the mxnet amp official doc, I execute the same code on Tesla V100( 16GB, single card ). However, I cannot get 60% speed increase, and it is only about 30% increase, as shown below. I'm not sure whether my experiment configuration is incorrect or not. Could you please give me some suggestions?

FP32

AMP

github-actions[bot] commented 4 years ago

Welcome to Apache MXNet (incubating)! We are on a mission to democratize AI, and we are glad that you are contributing to it by opening this issue. Please make sure to include all the relevant context, and one of the @apache/mxnet-committers will be here shortly. If you are interested in contributing to our project, let us know! Also, be sure to check out our guide on contributing to MXNet and our development guides wiki.

szha commented 4 years ago

@wzzju thanks for reporting. could you also share your workload?

sxjscience commented 4 years ago

I believe there is another issue that is related: https://github.com/apache/incubator-mxnet/issues/17665

wzzju commented 4 years ago

I believe there is another issue that is related: #17665

I find that MxNet used in NGC 20.06 container is customized, because there is no F.BatchNormAddRelu found in the incubator-mxnet repo. Besides, nn.BatchNorm in NGC 20.06 container MxNet version is also different from this repo, as described below. The left is in NGC 20.06 container mxnet version, and the right is in mxnet 1.5 official version.

Could you please tell me why does this happen? @sxjscience @szha Thank you in advance.

wzzju commented 4 years ago

@wzzju thanks for reporting. could you also share your workload?

Thanks. Tesla V100-16GB and the single card is used.