apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.77k stars 6.79k forks source link

Error building on ubuntu 18.04 on x64 with intel XEP, CUDA10.0, NCCL and TensorRT #17238

Closed mjsML closed 3 years ago

mjsML commented 4 years ago

Description

Trying to build mxnet with the following config : cmake -DUSE_CUDA=1 -DUSE_CUDA_PATH=/usr/local/cuda -DUSE_CUDNN=1 -DUSE_MKLDNN=1 -DCMAKE_BUILD_TYPE=Release -DUSE_TENSORRT=1 -DUSE_NCCL=1 -DUSE_NCCL_PATH=/usr/local -DMKL_INCLUDE_DIR=/home/mj/intel/mkl/include -DMKL_RT_LIBRARY=/home/mj/intel/mkl/lib -GNinja ..

cmake generates the make files successfully. Then the build fails at the very end. the cmakeerror log is below:

Error Message

Determining if the pthread_create exist failed with the following output:
Change Dir: /home/mj/mxnet/build/CMakeFiles/CMakeTmp

Run Build Command(s):/home/mj/anaconda3/bin/ninja cmTC_35eea 
[1/2] Building C object CMakeFiles/cmTC_35eea.dir/CheckSymbolExists.c.o
[2/2] Linking C executable cmTC_35eea
FAILED: cmTC_35eea 
: && /usr/lib/ccache/cc -mf16c -Wall -Wno-unknown-pragmas -Wno-sign-compare -O3 -msse3   CMakeFiles/cmTC_35eea.dir/CheckSymbolExists.c.o  -o cmTC_35eea   && :
CMakeFiles/cmTC_35eea.dir/CheckSymbolExists.c.o: In function `main':
CheckSymbolExists.c:(.text.startup+0x3): undefined reference to `pthread_create'
collect2: error: ld returned 1 exit status
ninja: build stopped: subcommand failed.

File /home/mj/mxnet/build/CMakeFiles/CMakeTmp/CheckSymbolExists.c:
/* */
#include <pthread.h>

int main(int argc, char** argv)
{
  (void)argv;
#ifndef pthread_create
  return ((int*)(&pthread_create))[argc];
#else
  (void)argc;
  return 0;
#endif
}

Determining if the function pthread_create exists in the pthreads failed with the following output:
Change Dir: /home/mj/mxnet/build/CMakeFiles/CMakeTmp

Run Build Command(s):/home/mj/anaconda3/bin/ninja cmTC_e1686 
[1/2] Building C object CMakeFiles/cmTC_e1686.dir/CheckFunctionExists.c.o
[2/2] Linking C executable cmTC_e1686
FAILED: cmTC_e1686 
: && /usr/lib/ccache/cc -mf16c -Wall -Wno-unknown-pragmas -Wno-sign-compare -O3 -msse3 -DCHECK_FUNCTION_EXISTS=pthread_create   CMakeFiles/cmTC_e1686.dir/CheckFunctionExists.c.o  -o cmTC_e1686  -lpthreads && :
/usr/bin/ld: cannot find -lpthreads
collect2: error: ld returned 1 exit status
ninja: build stopped: subcommand failed.

Performing C SOURCE FILE Test LIBOMP_HAVE_WNO_UNUSED_LOCAL_TYPEDEF_FLAG failed with the following output:
Change Dir: /home/mj/mxnet/build/CMakeFiles/CMakeTmp

Run Build Command(s):/home/mj/anaconda3/bin/ninja cmTC_01d40 
[1/2] Building C object CMakeFiles/cmTC_01d40.dir/src.c.o
FAILED: CMakeFiles/cmTC_01d40.dir/src.c.o 
/usr/lib/ccache/cc   -mf16c -Wall -Wno-unknown-pragmas -Wno-sign-compare -O3 -msse3 -DLIBOMP_HAVE_WNO_UNUSED_LOCAL_TYPEDEF_FLAG -fPIE   -Wunused-local-typedef -o CMakeFiles/cmTC_01d40.dir/src.c.o   -c src.c
cc: error: unrecognized command line option '-Wunused-local-typedef'; did you mean '-Wunused-local-typedefs'?
ninja: build stopped: subcommand failed.

Source file was:
int main(void) { return 0; }
Performing C SOURCE FILE Test LIBOMP_HAVE_WNO_COVERED_SWITCH_DEFAULT_FLAG failed with the following output:
Change Dir: /home/mj/mxnet/build/CMakeFiles/CMakeTmp

Run Build Command(s):/home/mj/anaconda3/bin/ninja cmTC_9c906 
[1/2] Building C object CMakeFiles/cmTC_9c906.dir/src.c.o
FAILED: CMakeFiles/cmTC_9c906.dir/src.c.o 
/usr/lib/ccache/cc   -mf16c -Wall -Wno-unknown-pragmas -Wno-sign-compare -O3 -msse3 -DLIBOMP_HAVE_WNO_COVERED_SWITCH_DEFAULT_FLAG -fPIE   -Wcovered-switch-default -o CMakeFiles/cmTC_9c906.dir/src.c.o   -c src.c
cc: error: unrecognized command line option '-Wcovered-switch-default'; did you mean '-Wno-switch-default'?
ninja: build stopped: subcommand failed.

Source file was:
int main(void) { return 0; }
Performing C SOURCE FILE Test LIBOMP_HAVE_WNO_DEPRECATED_REGISTER_FLAG failed with the following output:
Change Dir: /home/mj/mxnet/build/CMakeFiles/CMakeTmp

Run Build Command(s):/home/mj/anaconda3/bin/ninja cmTC_d798e 
[1/2] Building C object CMakeFiles/cmTC_d798e.dir/src.c.o
FAILED: CMakeFiles/cmTC_d798e.dir/src.c.o 
/usr/lib/ccache/cc   -mf16c -Wall -Wno-unknown-pragmas -Wno-sign-compare -O3 -msse3 -DLIBOMP_HAVE_WNO_DEPRECATED_REGISTER_FLAG -fPIE   -Wdeprecated-register -o CMakeFiles/cmTC_d798e.dir/src.c.o   -c src.c
cc: error: unrecognized command line option '-Wdeprecated-register'; did you mean '-frename-registers'?
ninja: build stopped: subcommand failed.

Source file was:
int main(void) { return 0; }
Performing C SOURCE FILE Test LIBOMP_HAVE_WNO_GNU_ANONYMOUS_STRUCT_FLAG failed with the following output:
Change Dir: /home/mj/mxnet/build/CMakeFiles/CMakeTmp

Run Build Command(s):/home/mj/anaconda3/bin/ninja cmTC_f6031 
[1/2] Building C object CMakeFiles/cmTC_f6031.dir/src.c.o
FAILED: CMakeFiles/cmTC_f6031.dir/src.c.o 
/usr/lib/ccache/cc   -mf16c -Wall -Wno-unknown-pragmas -Wno-sign-compare -O3 -msse3 -DLIBOMP_HAVE_WNO_GNU_ANONYMOUS_STRUCT_FLAG -fPIE   -Wgnu-anonymous-struct -o CMakeFiles/cmTC_f6031.dir/src.c.o   -c src.c
cc: error: unrecognized command line option '-Wgnu-anonymous-struct'
ninja: build stopped: subcommand failed.

Source file was:
int main(void) { return 0; }
Performing C SOURCE FILE Test LIBOMP_HAVE_WNO_SELF_ASSIGN_FLAG failed with the following output:
Change Dir: /home/mj/mxnet/build/CMakeFiles/CMakeTmp

Run Build Command(s):/home/mj/anaconda3/bin/ninja cmTC_69435 
[1/2] Building C object CMakeFiles/cmTC_69435.dir/src.c.o
FAILED: CMakeFiles/cmTC_69435.dir/src.c.o 
/usr/lib/ccache/cc   -mf16c -Wall -Wno-unknown-pragmas -Wno-sign-compare -O3 -msse3 -DLIBOMP_HAVE_WNO_SELF_ASSIGN_FLAG -fPIE   -Wself-assign -o CMakeFiles/cmTC_69435.dir/src.c.o   -c src.c
cc: error: unrecognized command line option '-Wself-assign'; did you mean '-Wcast-align'?
ninja: build stopped: subcommand failed.

Source file was:
int main(void) { return 0; }
Performing C SOURCE FILE Test LIBOMP_HAVE_WNO_VLA_EXTENSION_FLAG failed with the following output:
Change Dir: /home/mj/mxnet/build/CMakeFiles/CMakeTmp

Run Build Command(s):/home/mj/anaconda3/bin/ninja cmTC_664fc 
[1/2] Building C object CMakeFiles/cmTC_664fc.dir/src.c.o
FAILED: CMakeFiles/cmTC_664fc.dir/src.c.o 
/usr/lib/ccache/cc   -mf16c -Wall -Wno-unknown-pragmas -Wno-sign-compare -O3 -msse3 -DLIBOMP_HAVE_WNO_VLA_EXTENSION_FLAG -fPIE   -Wvla-extension -o CMakeFiles/cmTC_664fc.dir/src.c.o   -c src.c
cc: error: unrecognized command line option '-Wvla-extension'; did you mean '-fms-extensions'?
ninja: build stopped: subcommand failed.

Source file was:
int main(void) { return 0; }
Performing C SOURCE FILE Test LIBOMP_HAVE_WNO_FORMAT_PEDANTIC_FLAG failed with the following output:
Change Dir: /home/mj/mxnet/build/CMakeFiles/CMakeTmp

Run Build Command(s):/home/mj/anaconda3/bin/ninja cmTC_be40b 
[1/2] Building C object CMakeFiles/cmTC_be40b.dir/src.c.o
FAILED: CMakeFiles/cmTC_be40b.dir/src.c.o 
/usr/lib/ccache/cc   -mf16c -Wall -Wno-unknown-pragmas -Wno-sign-compare -O3 -msse3 -DLIBOMP_HAVE_WNO_FORMAT_PEDANTIC_FLAG -fPIE   -Wformat-pedantic -o CMakeFiles/cmTC_be40b.dir/src.c.o   -c src.c
cc: error: unrecognized command line option '-Wformat-pedantic'; did you mean '-Wno-pedantic'?
ninja: build stopped: subcommand failed.

Source file was:
int main(void) { return 0; }
Performing C SOURCE FILE Test LIBOMP_HAVE_MMIC_FLAG failed with the following output:
Change Dir: /home/mj/mxnet/build/CMakeFiles/CMakeTmp

Run Build Command(s):/home/mj/anaconda3/bin/ninja cmTC_aab2a 
[1/2] Building C object CMakeFiles/cmTC_aab2a.dir/src.c.o
FAILED: CMakeFiles/cmTC_aab2a.dir/src.c.o 
/usr/lib/ccache/cc   -mf16c -Wall -Wno-unknown-pragmas -Wno-sign-compare -O3 -msse3 -DLIBOMP_HAVE_MMIC_FLAG -mmic -fPIE   -mmic -o CMakeFiles/cmTC_aab2a.dir/src.c.o   -c src.c
cc: error: unrecognized command line option '-mmic'; did you mean '-fpic'?
cc: error: unrecognized command line option '-mmic'; did you mean '-fpic'?
ninja: build stopped: subcommand failed.

Source file was:
int main(void) { return 0; }
Performing C SOURCE FILE Test LIBOMP_HAVE_M32_FLAG failed with the following output:
Change Dir: /home/mj/mxnet/build/CMakeFiles/CMakeTmp

Run Build Command(s):/home/mj/anaconda3/bin/ninja cmTC_5861e 
[1/2] Building C object CMakeFiles/cmTC_5861e.dir/src.c.o
[2/2] Linking C executable cmTC_5861e
FAILED: cmTC_5861e 
: && /usr/lib/ccache/cc -mf16c -Wall -Wno-unknown-pragmas -Wno-sign-compare -O3 -msse3 -DLIBOMP_HAVE_M32_FLAG -m32  -rdynamic CMakeFiles/cmTC_5861e.dir/src.c.o  -o cmTC_5861e   && :
/usr/bin/ld: cannot find Scrt1.o: No such file or directory
/usr/bin/ld: cannot find crti.o: No such file or directory
/usr/bin/ld: skipping incompatible /usr/lib/gcc/x86_64-linux-gnu/7/libgcc.a when searching for -lgcc
/usr/bin/ld: cannot find -lgcc
/usr/bin/ld: skipping incompatible /usr/lib/gcc/x86_64-linux-gnu/7/libgcc_s.so.1 when searching for libgcc_s.so.1
/usr/bin/ld: cannot find libgcc_s.so.1
/usr/bin/ld: skipping incompatible /usr/lib/gcc/x86_64-linux-gnu/7/libgcc.a when searching for -lgcc
/usr/bin/ld: cannot find -lgcc
collect2: error: ld returned 1 exit status
ninja: build stopped: subcommand failed.

Source file was:
int main(void) { return 0; }
Determining if files windows.h;psapi.h exist failed with the following output:
Change Dir: /home/mj/mxnet/build/CMakeFiles/CMakeTmp

Run Build Command(s):/home/mj/anaconda3/bin/ninja cmTC_df20f 
[1/2] Building C object CMakeFiles/cmTC_df20f.dir/LIBOMP_HAVE_PSAPI_H.c.o
FAILED: CMakeFiles/cmTC_df20f.dir/LIBOMP_HAVE_PSAPI_H.c.o 
/usr/lib/ccache/cc   -mf16c -Wall -Wno-unknown-pragmas -Wno-sign-compare -O3 -msse3  -fPIE -o CMakeFiles/cmTC_df20f.dir/LIBOMP_HAVE_PSAPI_H.c.o   -c /home/mj/mxnet/build/CMakeFiles/CheckIncludeFiles/LIBOMP_HAVE_PSAPI_H.c
/home/mj/mxnet/build/CMakeFiles/CheckIncludeFiles/LIBOMP_HAVE_PSAPI_H.c:2:10: fatal error: windows.h: No such file or directory
 #include <windows.h>
          ^~~~~~~~~~~
compilation terminated.
ninja: build stopped: subcommand failed.

Source:
/* */
#include <windows.h>
#include <psapi.h>

int main(void){return 0;}

Determining if the function EnumProcessModules exists in the psapi failed with the following output:
Change Dir: /home/mj/mxnet/build/CMakeFiles/CMakeTmp

Run Build Command(s):/home/mj/anaconda3/bin/ninja cmTC_935d0 
[1/2] Building C object CMakeFiles/cmTC_935d0.dir/CheckFunctionExists.c.o
[2/2] Linking C executable cmTC_935d0
FAILED: cmTC_935d0 
: && /usr/lib/ccache/cc -mf16c -Wall -Wno-unknown-pragmas -Wno-sign-compare -O3 -msse3 -DCHECK_FUNCTION_EXISTS=EnumProcessModules   CMakeFiles/cmTC_935d0.dir/CheckFunctionExists.c.o  -o cmTC_935d0  -lpsapi && :
/usr/bin/ld: cannot find -lpsapi
collect2: error: ld returned 1 exit status
ninja: build stopped: subcommand failed.

Determining if the function __atomic_load_1 exists failed with the following output:
Change Dir: /home/mj/mxnet/build/CMakeFiles/CMakeTmp

Run Build Command(s):/home/mj/anaconda3/bin/ninja cmTC_bf21b 
[1/2] Building C object CMakeFiles/cmTC_bf21b.dir/CheckFunctionExists.c.o
<command-line>:0:23: warning: conflicting types for built-in function ‘__atomic_load_1’ [-Wbuiltin-declaration-mismatch]
/usr/local/share/cmake-3.14/Modules/CheckFunctionExists.c:7:3: note: in expansion of macro ‘CHECK_FUNCTION_EXISTS’
   CHECK_FUNCTION_EXISTS(void);
   ^~~~~~~~~~~~~~~~~~~~~
[2/2] Linking C executable cmTC_bf21b
FAILED: cmTC_bf21b 
: && /usr/lib/ccache/cc -mf16c -Wall -Wno-unknown-pragmas -Wno-sign-compare -O3 -msse3 -DCHECK_FUNCTION_EXISTS=__atomic_load_1  -rdynamic CMakeFiles/cmTC_bf21b.dir/CheckFunctionExists.c.o  -o cmTC_bf21b   && :
CMakeFiles/cmTC_bf21b.dir/CheckFunctionExists.c.o: In function `main':
CheckFunctionExists.c:(.text.startup+0xc): undefined reference to `__atomic_load_1'
collect2: error: ld returned 1 exit status
ninja: build stopped: subcommand failed.

Determining if the fopen64 exist failed with the following output:
Change Dir: /home/mj/mxnet/build/CMakeFiles/CMakeTmp

Run Build Command(s):/home/mj/anaconda3/bin/ninja cmTC_ee96b 
[1/2] Building C object CMakeFiles/cmTC_ee96b.dir/CheckSymbolExists.c.o
FAILED: CMakeFiles/cmTC_ee96b.dir/CheckSymbolExists.c.o 
/usr/lib/ccache/cc   -mf16c -Wall -Wno-unknown-pragmas -Wno-sign-compare -O3 -msse3 -fopenmp  -fPIE -o CMakeFiles/cmTC_ee96b.dir/CheckSymbolExists.c.o   -c CheckSymbolExists.c
CheckSymbolExists.c: In function ‘main’:
CheckSymbolExists.c:8:19: error: ‘fopen64’ undeclared (first use in this function); did you mean ‘fopen’?
   return ((int*)(&fopen64))[argc];
                   ^~~~~~~
                   fopen
CheckSymbolExists.c:8:19: note: each undeclared identifier is reported only once for each function it appears in
ninja: build stopped: subcommand failed.

File /home/mj/mxnet/build/CMakeFiles/CMakeTmp/CheckSymbolExists.c:
/* */
#include <stdio.h>

int main(int argc, char** argv)
{
  (void)argv;
#ifndef fopen64
  return ((int*)(&fopen64))[argc];
#else
  (void)argc;
  return 0;
#endif
}

To Reproduce

Simply run the cmake make on ubuntu 18.04 with CUDA10.0 and TesnorRT7 and NCCL2.5.6 with Intel XE 2020 (latest MKL)

Steps to reproduce

cmake -DUSE_CUDA=1 -DUSE_CUDA_PATH=/usr/local/cuda -DUSE_CUDNN=1 -DUSE_MKLDNN=1 -DCMAKE_BUILD_TYPE=Release -DUSE_TENSORRT=1 -DUSE_NCCL=1 -DUSE_NCCL_PATH=/usr/local -DMKL_INCLUDE_DIR=/home/mj/intel/mkl/include -DMKL_RT_LIBRARY=/home/mj/intel/mkl/lib -GNinja ..

ninja -v

What have you tried to solve it?

reinstalled libboost same error

Environment

curl --retry 10 -s https://raw.githubusercontent.com/dmlc/gluon-nlp/master/tools/diagnose.py | python

----------Python Info----------
Version      : 3.7.4
Compiler     : GCC 7.3.0
Build        : ('default', 'Aug 13 2019 20:35:49')
Arch         : ('64bit', '')
------------Pip Info-----------
Version      : 19.2
Directory    : /home/mj/anaconda3/lib/python3.7/site-packages/pip
----------MXNet Info-----------
Version      : 1.5.1
Directory    : /home/mj/anaconda3/lib/python3.7/site-packages/mxnet
Num GPUs     : 3
Commit Hash   : c9818480680f84daa6e281a974ab263691302ba8
----------System Info----------
Platform     : Linux-4.15.0-43-generic-x86_64-with-debian-buster-sid
system       : Linux
node         : prometheusu
release      : 4.15.0-43-generic
version      : #46-Ubuntu SMP Thu Dec 6 14:45:28 UTC 2018
----------Hardware Info----------
machine      : x86_64
processor    : x86_64
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              8
On-line CPU(s) list: 0-7
Thread(s) per core:  2
Core(s) per socket:  4
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               60
Model name:          Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
Stepping:            3
CPU MHz:             4181.322
CPU max MHz:         4400.0000
CPU min MHz:         800.0000
BogoMIPS:            8000.51
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            8192K
NUMA node0 CPU(s):   0-7
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm cpuid_fault invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts flush_l1d
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0034 sec, LOAD: 0.8277 sec.
Timing for GluonNLP GitHub: https://github.com/dmlc/gluon-nlp, DNS: 0.0006 sec, LOAD: 0.8006 sec.
Timing for GluonNLP: http://gluon-nlp.mxnet.io, DNS: 0.1911 sec, LOAD: 0.9508 sec.
Timing for D2L: http://d2l.ai, DNS: 0.1454 sec, LOAD: 0.2055 sec.
Timing for D2L (zh-cn): http://zh.d2l.ai, DNS: 0.0622 sec, LOAD: 0.4024 sec.
Timing for FashionMNIST: https://repo.mxnet.io/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.2145 sec, LOAD: 1.0244 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.3087 sec, LOAD: 1.7032 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.1470 sec, LOAD: 0.5086 sec.
leezu commented 4 years ago

Did you update all submodules?

mjsML commented 4 years ago

@leezu Sorry, I don't understand what do you mean by submodules ... I did clone master recursively and built tensorrt-onnx and onnx before I tried to build the root package.

leezu commented 4 years ago

Ok, that means you have up-to-date submodules. Otherwise git submodule update --init --recursive will update.

How did you build and install tensorrt-onnx and onnx?

mjsML commented 4 years ago

I used the steps mentioned below in the second script (the CI for docker): How to build mxnet with tensorrt support?

I deleted everything now and started from scratch with the submodule update and now building...

leezu commented 4 years ago

That script installs packages for Ubuntu 16.04, for example

wget -qO tensorrt.deb https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1604/x86_64/nvidia-machine-learning-repo-ubuntu1604_1.0.0-1_amd64.deb

That may cause your issues

mjsML commented 4 years ago

Sorry I wasn't clear. I didn't use the entire script ... I used the script steps below, because I got another error while building it and that fixed the error:


cmake \
        -DCMAKE_CXX_COMPILER_LAUNCHER=ccache \
        -DCMAKE_C_COMPILER_LAUNCHER=ccache \
        -DCMAKE_CXX_FLAGS=-I/usr/include/python${PYVER}\
        -DBUILD_SHARED_LIBS=ON ..\
        -G Ninja
    ninja -j 1 -v onnx/onnx.proto
    ninja -j 1 -v
    export LIBRARY_PATH=`pwd`:`pwd`/onnx/:$LIBRARY_PATH
    export CPLUS_INCLUDE_PATH=`pwd`:$CPLUS_INCLUDE_PATH

    # Build ONNX-TensorRT
    cd 3rdparty/onnx-tensorrt/
    mkdir -p build
    cd build
    cmake \
        -DCMAKE_CXX_COMPILER_LAUNCHER=ccache \
        -DCMAKE_C_COMPILER_LAUNCHER=ccache \
        ..
    make -j$(nproc)
    export LIBRARY_PATH=`pwd`:$LIBRARY_PATH```
mjsML commented 4 years ago

@leezu even with the submodule update the build still failed with the pthread and fopen64 errors. I'll try to build it with make instead of ninja and see what happens.

leezu commented 4 years ago

It seems the compilation instructions and the compilation process for building with TensorRT are lacking.. How is your experience with the make build?

For cmake, I suggest we improve the process to build onnx and onnx-tensorrt automatically as part of the MXNet build, similar to openmp. @mjsML are you familiar with cmake and would like to contribute?

mjsML commented 4 years ago

@leezu The pthread errors and had to do with some missing libs... This page specifically running apt with these packages:

apt-get install --no-install-recommends \ software-properties-common apt-transport-https \ build-essential cmake libjemalloc-dev \ libatlas-base-dev liblapack-dev liblapacke-dev libopenblas-dev libopencv-dev \ libcurl4-openssl-dev libzmq3-dev ninja-build libhdf5-dev libomp-dev fixed the errors related to those but then the assert.h inside MKLDNN didn't go away.

After few hours banging my head against the wall, I got to the root cause ... MKLDNN has a few "header leaks" which means I had to manually go in the MKLDNN src and add #include and the likes. That fixed the assert.h error .. but then a whole bunch of other leaks sprung because of improper header referencing all across MKLDNN of their own internal headers.

I ended up ditching MKLDNN (by setting -DUSE_MKLDNN=0) because my main target is to get a fast GPU build that utilizes the Nvidia packages (mainly NCCL and TensorRT on x64 with my desktop training machine and TensorRT on aarch64 with jetson Nano) ... When I have more time I'll pull on the intel repo to fix the header issues, however I'm not sure how you sync the 3rd party folder with the source? Now the build passes but the tests fail with linking errors ... I did specify that I wanted to use MKL for BLAS but I'm getting this when linking the examples / tests: //usr/lib/x86_64/-linuxusr-/gnubin//libblas.so.3ld:: errorlibmxnet.a (addingla_op.cc.o )symbols:: undefinedDSO referencemissing tofrom symbolcommand 'linecblas_dtrsm ' //usr/lib/x86_64-linux-gnu/libblas.so.3: error adding symbols: DSO missing from command collect2: error: ld returned 1 exit status line I then built OpenBLAS latest (0.3.8) and updated the symbols by updating alternatives as instructions: sudo update-alternatives --install /usr/lib/libblas.so.3 libblas.so.3 /opt/OpenBLAS/lib/libopenblas.so.0 41 \ --slave /usr/lib/liblapack.so.3 liblapack.so.3 /opt/OpenBLAS/lib/libopenblas.so.0 This was futile as well as I'm still stuck with the same linking error ... I'm at loss why do we need to link OpenBLAS in the first place if I built with MKL as my BLAS of choice.

Also sure I'd love to contribute to the build process or otherwise :) ... imho we need a few new build types actually ... my suggestions are Training and Inference by accelerator type (in this example, I'm building a CUDA training build (CUDA, cuDNN, NCCL, and TensorRT), while an Inference build would have CUDA, cuDNN and TesnorRT only (in my case edge accelerator on aarch64 too :/ ) ... [Insert accelerator type here] need also the same ... in my mind this is like the "Debug" and "Release" config in the ML world. Food for thought.

leezu commented 4 years ago

@pengzhao-intel are you aware of the header leaks mentioned by @mjsML?

@mjsML I suggest you don't link with MKL for now to fix your problem given your "main target is to get a fast GPU build that utilizes the Nvidia packages". There may be a bug in the build setup with MKL + the nvidia libs you mention.

however I'm not sure how you sync the 3rd party folder with the source?

The 3rdparty folder is updated time to time when new versions of MKL-DNN are released. Such 3rdparty folder may also go together with an update in mxnet source code to implement necessary changes due to API changes in MKL-DNN.

Also sure I'd love to contribute to the build process or otherwise :) ... imho we need a few new build types actually ... my suggestions are Training and Inference by accelerator type

Great. The first step would be to integrate the tensorrt build into the mxnet cmake build (assuming tensorrt has a sufficiently good cmake support). Providing recommended "Training and Inference by accelerator type" build configurations and making them easy to build would be great.

mjsML commented 4 years ago

@pengzhao-intel as a concrete example /mxnet/3rdparty/mkldnn/src/cpu/gemm/gemm_threading.hpp and /mxnet/3rdparty/mkldnn/src/cpu/gemm/gemm_pack_storage.hpp are missing the assert.h includes while using them ... I'm suspecting the other internal headers leaks (example includes of utils.hpp) are veneered by build / compiler arguments / include directories that neither my compiler nor I were not able to locate ...

mjsML commented 4 years ago

@leezu I'll try to build without MKL and see what happens then.

mjsML commented 4 years ago

@leezu Success!

>>> from mxnet.runtime import feature_list
>>> feature_list()
[✔ CUDA, ✔ CUDNN, ✔ NCCL, ✔ CUDA_RTC, ✔ TENSORRT, ✔ CPU_SSE, ✔ CPU_SSE2, ✔ CPU_SSE3, ✔ CPU_SSE4_1, ✔ CPU_SSE4_2, ✖ CPU_SSE4A, ✔ CPU_AVX, ✖ CPU_AVX2, ✔ OPENMP, ✖ SSE, ✔ F16C, ✔ JEMALLOC, ✔ BLAS_OPEN, ✖ BLAS_ATLAS, ✖ BLAS_MKL, ✖ BLAS_APPLE, ✔ LAPACK, ✖ MKLDNN, ✔ OPENCV, ✖ CAFFE, ✖ PROFILER, ✖ DIST_KVSTORE, ✖ CXX14, ✖ INT64_TENSOR_SIZE, ✔ SIGNAL_HANDLER, ✖ DEBUG, ✖ TVM_OP]

So it seems that MKL and MKLDNN with NVIDIA is the culbert here ... I'll pull on the build guides later tomorrow then ... Cheers and thanks for the help :) falls asleep

pengzhao-intel commented 4 years ago

@wuxun-zhang for looking into the issue

pengzhao-intel commented 4 years ago

@pengzhao-intel as a concrete example /mxnet/3rdparty/mkldnn/src/cpu/gemm/gemm_threading.hpp and /mxnet/3rdparty/mkldnn/src/cpu/gemm/gemm_pack_storage.hpp are missing the assert.h includes while using them ... I'm suspecting the other internal headers leaks (example includes of utils.hpp) are veneered by build / compiler arguments / include directories that neither my compiler nor I were not able to locate ...

Thanks for the information. We will investigate and fix the problem.

mseth10 commented 4 years ago

@mxnet-label-bot add [MKL]

bgawrych commented 3 years ago

Hey, I can't reproduce issue with assert.h headers on 1.x branch, however I was able to reproduce issue with linking tests - it can be solved by passing full path in DMKL_RT_LIBRARY to libmkl_rt.so e.g.: When I passed -DMKL_RT_LIBRARY=/opt/intel/mkl/lib/intel64/libmkl_rt.so everything works fine

@szha can we close this issue? I think issue is solved in new oneDNN versions