Closed mjsML closed 3 years ago
Did you update all submodules?
@leezu Sorry, I don't understand what do you mean by submodules ... I did clone master recursively and built tensorrt-onnx and onnx before I tried to build the root package.
Ok, that means you have up-to-date submodules. Otherwise git submodule update --init --recursive
will update.
How did you build and install tensorrt-onnx and onnx?
I used the steps mentioned below in the second script (the CI for docker): How to build mxnet with tensorrt support?
I deleted everything now and started from scratch with the submodule update and now building...
That script installs packages for Ubuntu 16.04, for example
wget -qO tensorrt.deb https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1604/x86_64/nvidia-machine-learning-repo-ubuntu1604_1.0.0-1_amd64.deb
That may cause your issues
Sorry I wasn't clear. I didn't use the entire script ... I used the script steps below, because I got another error while building it and that fixed the error:
cmake \
-DCMAKE_CXX_COMPILER_LAUNCHER=ccache \
-DCMAKE_C_COMPILER_LAUNCHER=ccache \
-DCMAKE_CXX_FLAGS=-I/usr/include/python${PYVER}\
-DBUILD_SHARED_LIBS=ON ..\
-G Ninja
ninja -j 1 -v onnx/onnx.proto
ninja -j 1 -v
export LIBRARY_PATH=`pwd`:`pwd`/onnx/:$LIBRARY_PATH
export CPLUS_INCLUDE_PATH=`pwd`:$CPLUS_INCLUDE_PATH
# Build ONNX-TensorRT
cd 3rdparty/onnx-tensorrt/
mkdir -p build
cd build
cmake \
-DCMAKE_CXX_COMPILER_LAUNCHER=ccache \
-DCMAKE_C_COMPILER_LAUNCHER=ccache \
..
make -j$(nproc)
export LIBRARY_PATH=`pwd`:$LIBRARY_PATH```
@leezu even with the submodule update the build still failed with the pthread and fopen64 errors. I'll try to build it with make instead of ninja and see what happens.
It seems the compilation instructions and the compilation process for building with TensorRT are lacking.. How is your experience with the make build?
For cmake, I suggest we improve the process to build onnx and onnx-tensorrt automatically as part of the MXNet build, similar to openmp. @mjsML are you familiar with cmake and would like to contribute?
@leezu The pthread errors and had to do with some missing libs... This page specifically running apt with these packages:
apt-get install --no-install-recommends \ software-properties-common apt-transport-https \ build-essential cmake libjemalloc-dev \ libatlas-base-dev liblapack-dev liblapacke-dev libopenblas-dev libopencv-dev \ libcurl4-openssl-dev libzmq3-dev ninja-build libhdf5-dev libomp-dev
fixed the errors related to those but then the assert.h inside MKLDNN didn't go away.
After few hours banging my head against the wall, I got to the root cause ... MKLDNN has a few "header leaks" which means I had to manually go in the MKLDNN src and add #include
I ended up ditching MKLDNN (by setting -DUSE_MKLDNN=0) because my main target is to get a fast GPU build that utilizes the Nvidia packages (mainly NCCL and TensorRT on x64 with my desktop training machine and TensorRT on aarch64 with jetson Nano) ... When I have more time I'll pull on the intel repo to fix the header issues, however I'm not sure how you sync the 3rd party folder with the source?
Now the build passes but the tests fail with linking errors ... I did specify that I wanted to use MKL for BLAS but I'm getting this when linking the examples / tests:
//usr/lib/x86_64/-linuxusr-/gnubin//libblas.so.3ld:: errorlibmxnet.a (addingla_op.cc.o )symbols:: undefinedDSO referencemissing tofrom symbolcommand 'linecblas_dtrsm ' //usr/lib/x86_64-linux-gnu/libblas.so.3: error adding symbols: DSO missing from command collect2: error: ld returned 1 exit status line
I then built OpenBLAS latest (0.3.8) and updated the symbols by updating alternatives as instructions:
sudo update-alternatives --install /usr/lib/libblas.so.3 libblas.so.3 /opt/OpenBLAS/lib/libopenblas.so.0 41 \ --slave /usr/lib/liblapack.so.3 liblapack.so.3 /opt/OpenBLAS/lib/libopenblas.so.0
This was futile as well as I'm still stuck with the same linking error ... I'm at loss why do we need to link OpenBLAS in the first place if I built with MKL as my BLAS of choice.
Also sure I'd love to contribute to the build process or otherwise :) ... imho we need a few new build types actually ... my suggestions are Training and Inference by accelerator type (in this example, I'm building a CUDA training build (CUDA, cuDNN, NCCL, and TensorRT), while an Inference build would have CUDA, cuDNN and TesnorRT only (in my case edge accelerator on aarch64 too :/ ) ... [Insert accelerator type here] need also the same ... in my mind this is like the "Debug" and "Release" config in the ML world. Food for thought.
@pengzhao-intel are you aware of the header leaks mentioned by @mjsML?
@mjsML I suggest you don't link with MKL for now to fix your problem given your "main target is to get a fast GPU build that utilizes the Nvidia packages". There may be a bug in the build setup with MKL + the nvidia libs you mention.
however I'm not sure how you sync the 3rd party folder with the source?
The 3rdparty folder is updated time to time when new versions of MKL-DNN are released. Such 3rdparty folder may also go together with an update in mxnet source code to implement necessary changes due to API changes in MKL-DNN.
Also sure I'd love to contribute to the build process or otherwise :) ... imho we need a few new build types actually ... my suggestions are Training and Inference by accelerator type
Great. The first step would be to integrate the tensorrt build into the mxnet cmake build (assuming tensorrt has a sufficiently good cmake support). Providing recommended "Training and Inference by accelerator type" build configurations and making them easy to build would be great.
@pengzhao-intel as a concrete example /mxnet/3rdparty/mkldnn/src/cpu/gemm/gemm_threading.hpp
and /mxnet/3rdparty/mkldnn/src/cpu/gemm/gemm_pack_storage.hpp
are missing the assert.h includes while using them ... I'm suspecting the other internal headers leaks (example includes of utils.hpp) are veneered by build / compiler arguments / include directories that neither my compiler nor I were not able to locate ...
@leezu I'll try to build without MKL and see what happens then.
@leezu Success!
>>> from mxnet.runtime import feature_list
>>> feature_list()
[✔ CUDA, ✔ CUDNN, ✔ NCCL, ✔ CUDA_RTC, ✔ TENSORRT, ✔ CPU_SSE, ✔ CPU_SSE2, ✔ CPU_SSE3, ✔ CPU_SSE4_1, ✔ CPU_SSE4_2, ✖ CPU_SSE4A, ✔ CPU_AVX, ✖ CPU_AVX2, ✔ OPENMP, ✖ SSE, ✔ F16C, ✔ JEMALLOC, ✔ BLAS_OPEN, ✖ BLAS_ATLAS, ✖ BLAS_MKL, ✖ BLAS_APPLE, ✔ LAPACK, ✖ MKLDNN, ✔ OPENCV, ✖ CAFFE, ✖ PROFILER, ✖ DIST_KVSTORE, ✖ CXX14, ✖ INT64_TENSOR_SIZE, ✔ SIGNAL_HANDLER, ✖ DEBUG, ✖ TVM_OP]
So it seems that MKL and MKLDNN with NVIDIA is the culbert here ... I'll pull on the build guides later tomorrow then ... Cheers and thanks for the help :) falls asleep
@wuxun-zhang for looking into the issue
@pengzhao-intel as a concrete example
/mxnet/3rdparty/mkldnn/src/cpu/gemm/gemm_threading.hpp
and/mxnet/3rdparty/mkldnn/src/cpu/gemm/gemm_pack_storage.hpp
are missing the assert.h includes while using them ... I'm suspecting the other internal headers leaks (example includes of utils.hpp) are veneered by build / compiler arguments / include directories that neither my compiler nor I were not able to locate ...
Thanks for the information. We will investigate and fix the problem.
@mxnet-label-bot add [MKL]
Hey, I can't reproduce issue with assert.h headers on 1.x branch, however I was able to reproduce issue with linking tests - it can be solved by passing full path in DMKL_RT_LIBRARY to libmkl_rt.so e.g.: When I passed -DMKL_RT_LIBRARY=/opt/intel/mkl/lib/intel64/libmkl_rt.so everything works fine
@szha can we close this issue? I think issue is solved in new oneDNN versions
Description
Trying to build mxnet with the following config :
cmake -DUSE_CUDA=1 -DUSE_CUDA_PATH=/usr/local/cuda -DUSE_CUDNN=1 -DUSE_MKLDNN=1 -DCMAKE_BUILD_TYPE=Release -DUSE_TENSORRT=1 -DUSE_NCCL=1 -DUSE_NCCL_PATH=/usr/local -DMKL_INCLUDE_DIR=/home/mj/intel/mkl/include -DMKL_RT_LIBRARY=/home/mj/intel/mkl/lib -GNinja ..
cmake generates the make files successfully. Then the build fails at the very end. the cmakeerror log is below:
Error Message
To Reproduce
Simply run the cmake make on ubuntu 18.04 with CUDA10.0 and TesnorRT7 and NCCL2.5.6 with Intel XE 2020 (latest MKL)
Steps to reproduce
What have you tried to solve it?
reinstalled libboost same error
Environment