apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.78k stars 6.79k forks source link

mx.sym.dot() performance on CPU #10881

Closed sandeep-krishnamurthy closed 4 years ago

sandeep-krishnamurthy commented 6 years ago

We are using mx.sym.dot() operator in Keras heavily. We observe CPU performance is suspiciously slower. On profiling a RNN LSTM example, the observation is as shown below.

dot() operator is contributing to 90% of computation time. Is there any performance implication of mx.sym.dot() operator on CPU?

We are using mxnet-mkl-dnn build, is the operator using gemm operations under the hood?

image

@anirudh2290 @zheng-da - Any suggestions / comments?

anirudh2290 commented 6 years ago

I see that we are still using mshadow dot here: https://github.com/apache/incubator-mxnet/blob/master/src/operator/tensor/dot-inl.h#L119 . @DickJC123 @zheng-da changed many operators to use linalg_gemm instead of mshadow::expr::dot. Can you guys provide more insight on if there was a performance gain from this change on CPU.

anirudh2290 commented 6 years ago

Adding @piiswrong for comment.

pengzhao-intel commented 6 years ago

@anirudh2290 Does mshadow::expr::dot call GEMM as well? yes, MKL GEMM will be much faster than other implementations. The pre-built version is not linked with MKL library. We're working on making a static link to MKL for pre-built binary.

anirudh2290 commented 6 years ago

Yes, AFAIK mshadow::expr::dot uses gemm. dot_engine-inl.h has standalone implementations and also supports calling other blas implementations: https://github.com/dmlc/mshadow/blob/master/mshadow/dot_engine-inl.h#L123 and https://github.com/dmlc/mshadow/blob/master/mshadow/dot_engine-inl.h#L280

pengzhao-intel commented 6 years ago

Thanks. So if we build from source by USE_BLASS=MKL, it will be faster. @sandeep-krishnamurthy could you take a try? FYI, you can set MKL_VERBOSE=1 so there're detailed information of MKL GEMM in the runtime. We can do further analysis and optimizaiton for the different size of GEMM.

sandeep-krishnamurthy commented 6 years ago

@pengzhao-intel @anirudh2290 - Thanks for your comments. Next step, I will try to build from source with USE_BLAS set to MKL and report back if there are performance gain.

@anirudh2290 - To summarize your comment - are you saying mx.sym.dot() does not use efficient MKL GEMM implementation?

@pengzhao-intel - If I do pip install mxnet-mkl are you saying we don't get mkl linked? If I use mxnet-mkl on a AWS C5 instance with MKL-DNN, will it use MKL?

TaoLv commented 6 years ago

I‘m afraid mxnet-mkl package is built with USE_BLAS=openblas. You can build from source with USE_BLAS=mkl if you have mkl library installed. Also, do you know how much of this computation time is consumed by lstm layer or other non-rnn fully connected layers? We are trying to build fused lstm operator for mxnet on cpu. Hope that will help you a lot.

lupesko commented 6 years ago

We're updating the labels to better indicate MXNet Backend issues. @sandeep-krishnamurthy can you please update the label from "C++" to "Backend"? Thanks!

lupesko commented 6 years ago

@pengzhao-intel is this something you guys can help with?

pengzhao-intel commented 6 years ago

@lupesko Sure

pengzhao-intel commented 6 years ago

Regarding dot, it's a kind of library operation, GEMM. There're no much can be optimized from the framework level. Just change to Intel MKL will achieve better performance.