[RFC] MXNet: Improve the build with DNNL, MKL, and OpenMP.

akarbown commented 3 years ago

[Author of the RFC: @TaoLv ]

Problem statement

This RFC targets discussing the strategies for building MXNet with Intel DNNL, MKL, and different OpenMP runtimes for different platforms. It will help to address (or mitigate if not fully addressed) the issues [1][2][3][4][5][6] and pave the way towards the CMake build system for the project. After all of these are in place, we can expect a better build experience across different platforms and keeping the promising performance on Linux.

The content can be divided into the following parts:

Clear the build flags for DNNL, MKL BLAS, OpenMP;
The build logic and decisions for DNNL;
The build logic and decisions for MKL BLAS;
Other considerations, eg. performance and interoperability.

Proposed solutions

Build Flags We propose to keep/promote the flags below in the future CMake build system:

Flags	Options	Description
USE_MKLDNN (or USE_DNNL)	ON, OFF	Whether to build MXNet with DNNL library to accelerate some operators.
USE_BLAS	openblas, mkl, apple, atlas	Choose BLAS library.
USE_OPENMP	ON, OFF	Whether to use OpenMP threading model.
MKL_USE_STATIC_LIBS	ON, OFF	Whether to link static libraries of MKL.
MKL_USE_ILP64	ON, OFF	Turn it ON when INT64 tensor is enabled.

And deprecate the flags as follows:

Flags	Justifications
USE_MKLML	MKLML library has been EOL and removed since MKL-DNN v1.0.
USE_MKL2017	MKL2017 integration has been removed from MXNet since MKL-DNN integrated.
USE_STATIC_MKL	Replaced by MKL_USE_STATIC_LIBS
USE_MKL_IF_AVAILABLE	Duplicate with USE_BLAS. Confusing when both of them are set.
MKL_USE_SINGLE_DYNAMIC_LIBRARY	We don’t want users to link libmkl_rt.so which depends on more explicit control during runtime. Removing it can make the code behavior and library linkage more deterministic.

Build with OpenMP As we all know, linking multiple OpenMP runtimes in a single application is error prone. To mitigate the long standing issue of OpenMP conflicts in MXNet, we suggest to adopt the same default linking behavior from DNNL library. That is to dynamically link the OpenMP runtime library which is provided by the compiler/system. It will help us on:

Easing the build logic for choosing different OpenMP runtimes;
Mitigating the interoperability of different OpenMP runtimes in different compiler ecosystems.

Users can decide whether to enable OpenMP threading by the USE_OPENMP flag. Once it’s set to OFF, the backend libraries like DNNL or MKL BLAS should also disable OpenMP threading and run in a sequential mode. With that being said, there is no need to distribute the source code of LLVM OpenMP or build it from scratch. We can rely on compilers to pull in different OpenMP runtimes. Please refer to the OpenMP.cmake module of DNNL for more implementation details. (A more radical approach is to provide an option for users to choose a different OpenMP runtime. That can lead to better performance, eg. link Intel OpenMP explicitly, but is more risky.)

Build with MKL-DNN (or DNNL) Intel MKL-DNN was renamed with DNNL in its v1.1 release. Since then, the MXNet community has been working on the transition to DNNL to leverage the latest features and optimizations from the library. That includes using the string “DNNL” or “dnnl” for future development and communication. We propose to promote the flag “USE_DNNL” since MXNet 2.0 and start deprecating “USE_MKLDNN” at the same time. DNNL source code resides in the 3rdparty/mkldnn folder of the MXNet repository and is released and distributed along with MXNet source code. If one wants to build MXNet with DNNL to accelerate the execution on Intel CPU, she/he needs to enable -DUSE_DNNL=ON in CMake. However, this flag has been set to ON by default for all platforms except edge devices. On the contrary, to disable the DNNL acceleration, one needs to set -DUSE_DNNL=OFF explicitly in the CMake command line or the CMake configuration file. As both MXNet and DNNL are under quick development with different release cadence, we decide to link the DNNL library into MXNet statically to avoid mis-linking in the user's environment. Given this, we need to set DNNL_LIBRARY_TYPE to STATIC when building DNNL. Some additional flags to build DNNL:

DNNl_CPU_RUNTIME: Need set it to SEQ explicitly when USE_OPENMP=OFF;
DNNL_ARCH_OPT_FLAGS: Need pass compiler options to this build flag in string. Eg. -march or -mtune for GCC.
MKLDNN_BUILD_TESTS and MKLDNN_BUILD_EXAMPLES: We set these two flags to OFF to speed up the compilation.

One thing that needs to be taken care of is that the header dnnl_config.h and dnnl_version.h will be generated dynamically during compilation and will be copied to the installation destination when calling make install. That means these two headers are not distributed with DNNL source code. For downstream projects which are including these headers need to find them in the installation path rather than the source code path.

Build with MKL BLAS MXNet users can choose a BLAS library through the flag USE_BLAS which supports openblas, mkl, atlas, and apple for MacOS. To use Intel MKL BLAS, one can install it through apt or yum following the instructions from Intel: Installing Intel® Performance Libraries and Intel® Distribution for Python* Using APT Repository. MXNet also provides a tool script for Ubuntu, please refer to the ubuntu_mkl.sh under ci/docker/install. For linking MKL BLAS to MXNet, we suggest following the advice from Intel® Math Kernel Library Link Line Advisor. As we can see from the advisor tool, we need note that:

Clang is not supported on Linux;
LLVM OpenMP runtime is not supported (thus the following configuration: MXNet + clang + MKL + linux should not be supported);
When using GCC on Linux, it’s suggested to link the GNU counterpart libraries and OpenMP runtime, eg. libmkl_gnu_thread.a/so and libgomp.so;
On Windows/MacOS, MKL BLAS can only be used with Intel OpenMP; (???)
When OpenMP is disabled, we need to link libmkl_sequential.a/so instead of libmkl_gnu/intel_thread.a/so;
For large tensors with one dimension size > INT32_MAX, to make MKL BLAS work, we need to link libmkl_intel_ilp64.a/so rather than libmkl_intel_lp64.a/so.

Given the constraints above, a typical CMake logic for MKL should look as follows. We still need to add more fine-grained checks for different platform and compiler combinations. In this proposal, we suggest linking MKL BLAS libraries dynamically. The are mainly two reasons to do so:

Statically linking MKL libraries will considerably increase the size of MXNet binary.
Statically linking MKL libraries needs recompilation in case of any changes in the MKL libraries.

Same as other third party dependencies, static linking will help us to better distribute MXNet to different systems and environments, without any unexpected functionality or performance issues. We provide the flag MKL_USE_STATIC_LIBS to enable static linking when it’s needed. Anyway, disabling static linking is also supported by setting MKL_USE_STATIC_LIBS to OFF.

Performance and Interoperability

Performance

Although DNNL has provided many performance primitives to accelerate those NN related operators in MXNet, we still depend on MKL to improve the performance of other linear algebra, random number generation, and vector operations through it’s BLAS, VML, VSL libraries. We strongly encourage users to build MXNet with MKL BLAS and the community to release convenient binaries with MKL enabled. It's common sense that Intel OpenMP outperforms other OpenMP runtimes on Intel CPUs. We also hope users will link to Intel OpenMP for better performance. But because we are suggesting to link the OpenMP runtime provided by the compiler in the above section, now the problem is that we need to enable the Intel compiler build process which seems to be broken at this moment [6]. Given this, the recommended build on Linux so far should be: GCC + MKL BLAS + DNNL + GNU OpenMP.

Interoperability of OpenMP

Though we can address the dual linkage of OpenMP in MXNet and hence remove the conflicts reported in [1], we still need to be aware of the risks in downstream projects. One possible scenario is described in [7]. When MXNet is linking with one OpenMP runtime and the user assembles it with another tool which is linking with another OpenMP runtime (or another version of the same OpenMP runtime), it’s still problematic. That’s why we’re suggesting to link OpenMP according to compiler and link it in dynamical way:

We assume that the OpenMP runtime has better interoperability in its own compiler ecosystem. Eg. GNU OpenMP in the GCC community.

Dynamically linking makes it possible for users to create symbolic links to workaround the conflict issue or pre-load another runtime.

References

[1] #17641 [2] #17366 [3] #10856 [4] #9205 [5] #11417 [6] #14086 [7] #8532

github-actions[bot] commented 3 years ago

Welcome to Apache MXNet (incubating)! We are on a mission to democratize AI, and we are glad that you are contributing to it by opening this issue. Please make sure to include all the relevant context, and one of the @apache/mxnet-committers will be here shortly. If you are interested in contributing to our project, let us know! Also, be sure to check out our guide on contributing to MXNet and our development guides wiki.

akarbown commented 3 years ago

CC: @leezu, @TaoLv, @anko-intel

leezu commented 3 years ago

Thank you @akarbown for opening this RFC. It LGTM overall. Some minor comments / questions below:

I think MKL_USE_ILP64 does not need to be exposed to the user and we may re-use the MXNet USE_INT64_TENSOR_SIZE flag. We should switch to USE_DNNL sooner rather than later.

If ILP64 is used, we may be forced to static link MKL, to avoid symbol conflicts with LP64 Blas libraries that could be loaded into the same process if users import MXNet and NumPy in the same Python process. cc @access2rohit @Zha0q1

To use Intel MKL BLAS, one can install it through apt or yum following the instructions from Intel: Installing Intel® Performance Libraries and Intel® Distribution for Python* Using APT Repository.

We also need to support the upstream Ubuntu MKL packages which are available from Ubuntu 20.04 https://packages.ubuntu.com/focal/intel-mkl In my preliminary trial, the following patch works to find Ubuntu intel-mkl package (and may allow us to delete MXNet's FindMKL.cmake):

modified   cmake/ChooseBlas.cmake
@@ -47,16 +47,15 @@ elseif(BLAS STREQUAL "Open" OR BLAS STREQUAL "open")
   add_definitions(-DMXNET_USE_BLAS_OPEN=1)
 elseif(BLAS STREQUAL "MKL" OR BLAS STREQUAL "mkl")
   if (USE_INT64_TENSOR_SIZE)
-    set(MKL_USE_ILP64 ON CACHE BOOL "enable using ILP64 in MKL" FORCE)
+    set(BLA_VENDOR Intel10_64ilp)
+    find_package(BLAS)
   else()
-    if(MKL_USE_ILP64)
-      message(FATAL_ERROR "MKL_USE_ILP64 cannot be set without USE_INT64_TENSOR_SIZE; "
-              "Please set USE_INT64_TENSOR_SIZE instead of MKL_USE_ILP64.")
-    endif()
+    set(BLA_VENDOR Intel10_64lp)
+    find_package(BLAS)
   endif()
-  find_package(MKL REQUIRED)
+  find_path(MKL_INCLUDE_DIR mkl.h PATHS ENV MKLROOT PATH_SUFFIXES include mkl REQUIRED)
   include_directories(SYSTEM ${MKL_INCLUDE_DIR})
-  list(APPEND mshadow_LINKER_LIBS ${MKL_LIBRARIES})
+  list(APPEND mshadow_LINKER_LIBS ${BLAS_LIBRARIES})
   add_definitions(-DMSHADOW_USE_CBLAS=0)
   add_definitions(-DMSHADOW_USE_MKL=1)
   add_definitions(-DMXNET_USE_BLAS_MKL=1)

need to enable the Intel compiler build process which seems to be broken at this moment [6].

Is there any update on the ICC upstream compiler fix?

Zha0q1 commented 3 years ago

After https://github.com/apache/incubator-mxnet/commit/6fa208b4aedf069ddec0de7bd8d8e5cf7b62e97b
USE_INT64_TENSOR_SIZE controls lp64/ilp64 MKL and I think we are already defaulting MKL_USE_STATIC_LIBS to true

Zha0q1 commented 3 years ago

If ILP64 is used, we may be forced to static link MKL, to avoid symbol conflicts with LP64 Blas libraries that could be loaded into the same process if users import MXNet and NumPy in the same Python process. cc @access2rohit @Zha0q1

I think we should force static link at all time because of the performance boost of mkl over say openblas. Some cblas functions are a few times faster and lapack functions haven been seen to have 25x boost

leezu commented 3 years ago

I think we should force static link at all time because of the performance boost of mkl over say openblas

Static link and performance differences between mkl and openblas are two separate topics. Could you elaborate your reasoning?

Zha0q1 commented 3 years ago

I think we should force static link at all time because of the performance boost of mkl over say openblas

Static link and performance differences between mkl and openblas are two separate topics. Could you elaborate your reasoning?

Yes, my point is that if numpy uses openblas and mxnet uses mkl then we certainly do not want openblas to mask mkl

TaoLv commented 3 years ago

@pengzhao-intel @ciyongch @yinghu5 Please help to review.

We should switch to USE_DNNL sooner rather than later.

The proposal was drafted quite a while ago. Since then, DNNL library has been renamed to oneDNN to be consistent with the rest of oneAPI libraries [1]. Maybe we need also consider to use flag USE_ONEDNN? @mgouicem

[1] https://github.com/oneapi-src/oneDNN#oneapi-deep-neural-network-library-onednn

akarbown commented 3 years ago

Thank you @akarbown for opening this RFC. It LGTM overall. Some minor comments / questions below:

I think MKL_USE_ILP64 does not need to be exposed to the user and we may re-use the MXNet USE_INT64_TENSOR_SIZE flag. We should switch to USE_DNNL sooner rather than later.

If ILP64 is used, we may be forced to static link MKL, to avoid symbol conflicts with LP64 Blas libraries that could be loaded into the same process if users import MXNet and NumPy in the same Python process. cc @access2rohit @Zha0q1

To use Intel MKL BLAS, one can install it through apt or yum following the instructions from Intel: Installing Intel® Performance Libraries and Intel® Distribution for Python* Using APT Repository.

We also need to support the upstream Ubuntu MKL packages which are available from Ubuntu 20.04 https://packages.ubuntu.com/focal/intel-mkl In my preliminary trial, the following patch works to find Ubuntu intel-mkl package (and may allow us to delete MXNet's FindMKL.cmake):
modified   cmake/ChooseBlas.cmake
@@ -47,16 +47,15 @@ elseif(BLAS STREQUAL "Open" OR BLAS STREQUAL "open")
   add_definitions(-DMXNET_USE_BLAS_OPEN=1)
 elseif(BLAS STREQUAL "MKL" OR BLAS STREQUAL "mkl")
   if (USE_INT64_TENSOR_SIZE)
-    set(MKL_USE_ILP64 ON CACHE BOOL "enable using ILP64 in MKL" FORCE)
+    set(BLA_VENDOR Intel10_64ilp)
+    find_package(BLAS)
   else()
-    if(MKL_USE_ILP64)
-      message(FATAL_ERROR "MKL_USE_ILP64 cannot be set without USE_INT64_TENSOR_SIZE; "
-              "Please set USE_INT64_TENSOR_SIZE instead of MKL_USE_ILP64.")
-    endif()
+    set(BLA_VENDOR Intel10_64lp)
+    find_package(BLAS)
   endif()
-  find_package(MKL REQUIRED)
+  find_path(MKL_INCLUDE_DIR mkl.h PATHS ENV MKLROOT PATH_SUFFIXES include mkl REQUIRED)
   include_directories(SYSTEM ${MKL_INCLUDE_DIR})
-  list(APPEND mshadow_LINKER_LIBS ${MKL_LIBRARIES})
+  list(APPEND mshadow_LINKER_LIBS ${BLAS_LIBRARIES})
   add_definitions(-DMSHADOW_USE_CBLAS=0)
   add_definitions(-DMSHADOW_USE_MKL=1)
   add_definitions(-DMXNET_USE_BLAS_MKL=1)
I'll take a look into it.

need to enable the Intel compiler build process which seems to be broken at this moment [6].

Is there any update on the ICC upstream compiler fix?

I don't know yet. I'll keep you posted as I'll get more knowledge on this issue.

mgouicem commented 3 years ago

The proposal was drafted quite a while ago. Since then, DNNL library has been renamed to oneDNN to be consistent with the rest of oneAPI libraries [1]. Maybe we need also consider to use flag USE_ONEDNN? @mgouicem

Yes, the library was renamed to oneDNN. It makes sense to rename the flag to USE_ONEDNN since this is the latest official name for the library.

akarbown commented 3 years ago

And deprecate the flags as follows: USE_STATIC_MKL Replaced by MKL_USE_STATIC_LIBS

Do you think this replacement is necessary or just leave it with BLA_STATIC (as it's done for now and how it's provided by upstream cmake FindBLAS)?

leezu commented 3 years ago

Just using the upstream BLA_STATIC should be fine.

akarbown commented 3 years ago

I'd like to reignite the discussion connected with the following areas and give some update: • Removing MKL_USE_SINGLE_DYNAMIC_LIBRARY (mentioned in the 'Build Flags' RFC part) • Dynamically linking the OpenMP runtime library which is provided by the compiler/system (mentioned in the 'Build with OpenMP' & 'Interoperability OpenMP' part) • Having two runtime OpenMPs in the same process what resulted in the hang ('Interoperability OpenMP' part).

I think that knowing the root-cause of the hang (described here: #18255) it's worth to reconsider leaving the MKL_USE_SINGLE_DYNAMIC_LIBRARY and using that SDL as the solution of the problem connected with local/global symbol lookup namespaces appeared (RTLD_GLOBAL/RTLD_LOCAL) that appeared in the runtime while linking all the dynamic libraries separately. Providing that the issue reported internally will be resolved.

I've removed LLVM OpenMP as the start of the process of enabling compiler based OpenMP. Thus, compiling with GCC resulted in linking libgomp (GNU OpenMP). However, it caused some performance drop that needs to be investigated. (#20092).

akarbown commented 3 years ago

I'd like to reignite the discussion connected with the following areas and give some update: • Removing MKL_USE_SINGLE_DYNAMIC_LIBRARY (mentioned in the 'Build Flags' RFC part) • Dynamically linking the OpenMP runtime library which is provided by the compiler/system (mentioned in the 'Build with OpenMP' & 'Interoperability OpenMP' part) • Having two runtime OpenMPs in the same process what resulted in the hang ('Interoperability OpenMP' part).

I think that knowing the root-cause of the hang (described here: #18255) it's worth to reconsider leaving the MKL_USE_SINGLE_DYNAMIC_LIBRARY and using that SDL as the solution of the problem connected with local/global symbol lookup namespaces appeared (RTLD_GLOBAL/RTLD_LOCAL) that appeared in the runtime while linking all the dynamic libraries separately. Providing that the issue reported internally will be resolved.

I've removed LLVM OpenMP as the start of the process of enabling compiler based OpenMP. Thus, compiling with GCC resulted in linking libgomp (GNU OpenMP). However, it caused some performance drop that needs to be investigated. (#20092).

The bug will be fixed in the next oneAPI release - 2021.3. Just after the release I'll open the PR addressing this problem.

akarbown commented 2 years ago

@szha, @leezu - can we close that RFC? Or there is still something unresolved?

szha commented 2 years ago

@akarbown thanks!

apache / mxnet