Open chinakook opened 6 years ago
I've seen some cudnn example slower in windows than linux, then gave up windows, is this a mxnet problem?
But the cpu performance of Windows is also lower. Our customers use Windows so I cannot give up.
What version of VC++ are you compiling with?
@zhreshold VS2015 Update 3
Well, I don't think it's a specific problem related to mxnet, but a general performance issue between windows/linux
@zhreshold Can you elaborate on what would make Windows slower in general than Linux? Intel claims MKL performance should be roughly similar, and the critical path when running MXNet on CPU should be MKL calls.
@chinakook Are you seeing this both during training and inference? Does it happen with a public dataset / model that someone could be used as a reference? I think the next step would be profiling the performance on both platforms, and looking for relative differences. For example operators that take much longer on Windows than Linux.
@KellenSunderland Yes, It's on both training and inferencing. I'll try to give some test cases.
When I installed MKL2018 and build the latest MXNet version, the performance of Windows version get nearly the same with that of Ubuntu on CPU. However, there is still performance gap between Windows and Linux on GPU.
Good to hear that the performance is similar with the new MKL version. Thanks for the update @chinakook . It would still be interesting to see where the perf gaps are when using GPU.
@apache/mxnet-committers: This issue has been inactive for the past 90 days. It has no label and needs triage.
For general "how-to" questions, our user forum (and Chinese version) is a good place to get help.
@nswamy please add 'Performance', 'Windows' to this topic
This is really weird, as in my case windows is 4 times faster than windows for Training on CPU a Fully connected network. The training time is exactly the same regardless using or not MKL-DNN and jemalloc. BLAS is openblas. This is true for versions 1.2.0 and 1.3.1
@paoloviviani the MNist dataset is too small to benchmark. The bottleneck will be io.
@chinakook In my case is not MNIST, the iterator is made of an std::vector of NDArrays, which is already loaded in memory. But I see your point, I might look elsewhere for a bottleneck.
@chinakook only for the record: my issue was related to OpenBLAS being compiled for a different intel architecture. Compiling it with DYNAMIC_ARCH improved the performance by a factor of 3.
Thx, I will try later.
I've been testing MXNet on both Win10 and Ubuntu 16.04 in a long time. I found that the performance of Windows is lower than Linux by 15%-20% on both GPU and CPU contexts. So I wonder what makes the performance gap and I hope someone can solve this problem. My config is like following: MXNet on Ubuntu: MKL blas/CUDA8/CUDNN6/OpenCV 2.4.13/no jemalloc, no gperftools/building with Makefile MXNet on Windows: MKL blas/CUDA8/CUDNN6/OpenCV 2.4.13/no jemalloc, no gperftools/building with CMake