apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.77k stars 6.79k forks source link

Fail to build mxnet from source #17181

Closed apeforest closed 4 years ago

apeforest commented 4 years ago

Description

Following the steps at https://mxnet.apache.org/get_started/ubuntu_setup I got the following error:

Error Message

OSError: /usr/lib/liblapack.so.3: undefined symbol: gotoblas

To Reproduce

    rm -rf build
    mkdir -p build && cd build
    cmake -GNinja \
        -DUSE_CUDA=OFF \
        -DUSE_MKL_IF_AVAILABLE=ON \
        -DCMAKE_CUDA_COMPILER_LAUNCHER=ccache \
        -DCMAKE_C_COMPILER_LAUNCHER=ccache \
        -DCMAKE_CXX_COMPILER_LAUNCHER=ccache \
        -DCMAKE_BUILD_TYPE=Release \
    ..
    ninja

then

python -c "import mxnet"

Environment

We recommend using our script for collecting the diagnositc information. Run the following command and paste the outputs below:

----------Python Info----------
Version      : 3.6.6
Compiler     : GCC 7.2.0
Build        : ('default', 'Jun 28 2018 17:14:51')
Arch         : ('64bit', '')
------------Pip Info-----------
Version      : 19.3.1
Directory    : /home/ubuntu/.virtualenvs/mxnet/lib/python3.6/site-packages/pip
----------MXNet Info-----------
Hashtag not found. Not installed from pre-built package.
----------System Info----------
Platform     : Linux-4.4.0-1094-aws-x86_64-with-debian-stretch-sid
system       : Linux
node         : ip-172-31-15-220
release      : 4.4.0-1094-aws
version      : #105-Ubuntu SMP Mon Sep 16 13:08:01 UTC 2019
----------Hardware Info----------
machine      : x86_64
processor    : x86_64
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                32
On-line CPU(s) list:   0-31
Thread(s) per core:    2
Core(s) per socket:    16
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Model name:            Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
Stepping:              1
CPU MHz:               2699.984
CPU max MHz:           3000.0000
CPU min MHz:           1200.0000
BogoMIPS:              4600.09
Hypervisor vendor:     Xen
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              46080K
NUMA node0 CPU(s):     0-31
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx xsaveopt
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0131 sec, LOAD: 0.4585 sec.
Timing for GluonNLP GitHub: https://github.com/dmlc/gluon-nlp, DNS: 0.0005 sec, LOAD: 0.4164 sec.
Timing for GluonNLP: http://gluon-nlp.mxnet.io, DNS: 0.2340 sec, LOAD: 0.3821 sec.
Timing for D2L: http://d2l.ai, DNS: 0.1774 sec, LOAD: 0.1042 sec.
Timing for D2L (zh-cn): http://zh.d2l.ai, DNS: 0.0059 sec, LOAD: 0.2229 sec.
Timing for FashionMNIST: https://repo.mxnet.io/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0982 sec, LOAD: 0.1264 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0173 sec, LOAD: 0.3984 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0106 sec, LOAD: 0.0637 sec.
samskalicky commented 4 years ago

@apeforest assign [@apeforest ]

mjsML commented 4 years ago

This fixed it for me. You basically need to update the OpenBLAS libs to latest with a the dynamic architecture flag ...

apeforest commented 4 years ago

@mjsML Thanks. Issue fixed.

leezu commented 4 years ago

@apeforest what platform are you on? Ideally the installation instructions should be updated based on this issue

apeforest commented 4 years ago

@leezu I was using the latest AWS DLAMI. I will verify once more.

leezu commented 4 years ago

There are both 16.04 and 18.04 versions of DLAMI. Based on your kernel version, you are using 16.04. 18.04 works fine for me.

If 16.04 ships with problematic openblas, let's add a Note to the install guide.

mjsML commented 4 years ago

This issue happens on 18.04 too (this is my OS at the moment) ... I think it depends which subversion of ubuntu you have ... This is a very old install of 18.04 (probably 18.04.00 or the likes) ... and given I have not updated OpenBLAS explicitly it happened on 18.04 too .. according to the build guide from OpenBLAS the core issue is some symbols don't make it if you don't compile with the dynamicarch flag (so even if you have the most recent build but you didn't use this flag you'd get the same error.)

larroy commented 4 years ago

Which branch is this? master or 1.6.x?

mjsML commented 4 years ago

@larroy master ... a similar issue also occurs when you chose MKL as BLAS too (the linker complains on another symbol tho) ...

larroy commented 4 years ago

what's the root cause of this? This is something we should add to CI if users are facing these difficulties.

mjsML commented 4 years ago

agreed ... as I mentioned above it's a well documented issue with OpenBLAS and debian based distros.

Debian and Ubuntu LTS versions provide OpenBLAS package which is not updated after initial release, and under circumstances one might want to use more recent version of OpenBLAS e.g. to get support for newer CPUs

Ubuntu and Debian provides 'alternatives' mechanism to comfortably replace BLAS and LAPACK libraries systemwide.

After successful build of OpenBLAS (with DYNAMIC_ARCH set to 1)

$ make clean
$ make DYNAMIC_ARCH=1
$ sudo make DYNAMIC_ARCH=1 install

Basically the symbol mentioned above is missing from the default binaries that ships with debian based distros, so OpenBLAS needs to be rebuilt from latest source with this flag to force the compiler to output the symbol that the linker here is looking for(in this instance the old "gotoblas" symbol).

leezu commented 4 years ago

@apeforest @mjsML please provide the output of your cmake commands. It's unclear whether MKL is found during your compilation attempt based on the information provided.

I can't reproduce this issue both when linking against MKL as well as linking against OpenBLAS on a blank Ubuntu 18.04 system.

Are you compiling inside conda?

@mjsML we don't require updating OpenBLAS. This is unrelated.

apeforest commented 4 years ago

Okay if I follow all the instructions on the web to apt update all required packages, I no longer face the issue. I guess it's probably I messed up cmake in my old enviroment. @mjsML if you also don't see this issue any more, I guess we can close it. Thanks @leezu for investigation.

leezu commented 4 years ago

I'll close this issue for now, please reopen @mjsML if you can still reproduce this.