apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.77k stars 6.79k forks source link

Performance of MXNet on Windows is lower than that on Linux by 15%-20% #8335

Open chinakook opened 6 years ago

chinakook commented 6 years ago

I've been testing MXNet on both Win10 and Ubuntu 16.04 in a long time. I found that the performance of Windows is lower than Linux by 15%-20% on both GPU and CPU contexts. So I wonder what makes the performance gap and I hope someone can solve this problem. My config is like following: MXNet on Ubuntu: MKL blas/CUDA8/CUDNN6/OpenCV 2.4.13/no jemalloc, no gperftools/building with Makefile MXNet on Windows: MKL blas/CUDA8/CUDNN6/OpenCV 2.4.13/no jemalloc, no gperftools/building with CMake

wzhang1 commented 6 years ago

I've seen some cudnn example slower in windows than linux, then gave up windows, is this a mxnet problem?

chinakook commented 6 years ago

But the cpu performance of Windows is also lower. Our customers use Windows so I cannot give up.

zhreshold commented 6 years ago

What version of VC++ are you compiling with?

chinakook commented 6 years ago

@zhreshold VS2015 Update 3

zhreshold commented 6 years ago

Well, I don't think it's a specific problem related to mxnet, but a general performance issue between windows/linux

KellenSunderland commented 6 years ago

@zhreshold Can you elaborate on what would make Windows slower in general than Linux? Intel claims MKL performance should be roughly similar, and the critical path when running MXNet on CPU should be MKL calls.

@chinakook Are you seeing this both during training and inference? Does it happen with a public dataset / model that someone could be used as a reference? I think the next step would be profiling the performance on both platforms, and looking for relative differences. For example operators that take much longer on Windows than Linux.

chinakook commented 6 years ago

@KellenSunderland Yes, It's on both training and inferencing. I'll try to give some test cases.

chinakook commented 6 years ago

When I installed MKL2018 and build the latest MXNet version, the performance of Windows version get nearly the same with that of Ubuntu on CPU. However, there is still performance gap between Windows and Linux on GPU.

KellenSunderland commented 6 years ago

Good to hear that the performance is similar with the new MKL version. Thanks for the update @chinakook . It would still be interesting to see where the perf gaps are when using GPU.

szha commented 6 years ago

@apache/mxnet-committers: This issue has been inactive for the past 90 days. It has no label and needs triage.

For general "how-to" questions, our user forum (and Chinese version) is a good place to get help.

lanking520 commented 6 years ago

@nswamy please add 'Performance', 'Windows' to this topic

paoloviviani commented 5 years ago

This is really weird, as in my case windows is 4 times faster than windows for Training on CPU a Fully connected network. The training time is exactly the same regardless using or not MKL-DNN and jemalloc. BLAS is openblas. This is true for versions 1.2.0 and 1.3.1

chinakook commented 5 years ago

@paoloviviani the MNist dataset is too small to benchmark. The bottleneck will be io.

paoloviviani commented 5 years ago

@chinakook In my case is not MNIST, the iterator is made of an std::vector of NDArrays, which is already loaded in memory. But I see your point, I might look elsewhere for a bottleneck.

paoloviviani commented 5 years ago

@chinakook only for the record: my issue was related to OpenBLAS being compiled for a different intel architecture. Compiling it with DYNAMIC_ARCH improved the performance by a factor of 3.

chinakook commented 5 years ago

Thx, I will try later.