apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.78k stars 6.79k forks source link

Can't run horovod with latest nightly wheel #17292

Closed apeforest closed 4 years ago

apeforest commented 4 years ago

Description

Cannot run horovod with latest nightly wheel. It could mean the 1.6 release have same problem too. The last working nightly wheel was 12/30/2019

Error Message

OSError: /home/ubuntu/.virtualenvs/build/lib/python3.6/site-packages/horovod/mxnet/mpi_lib.cpython-36m-x86_64-linux-gnu.so: undefined symbol: _Z20MXAPIHandleExceptionRKSt9exception

To Reproduce\

Steps to reproduce

pip install https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2019-12-30/dist/mxnet_cu92-1.6.0b20191230-py2.py3-none-manylinux1_x86_64.whl
pip install horovod
python -c "import horovod.mxnet as hvd"

What have you tried to solve it?

  1. Tried mxnet-cu100 nightly wheel on 12/30/2019. It worked
  2. Tried mxnet-cu92 nightly wheel on 12/30/2019. It worked

Environment

----------Python Info----------
Version      : 3.6.6
Compiler     : GCC 7.2.0
Build        : ('default', 'Jun 28 2018 17:14:51')
Arch         : ('64bit', '')
------------Pip Info-----------
Version      : 19.3.1
Directory    : /home/ubuntu/.virtualenvs/build/lib/python3.6/site-packages/pip
----------MXNet Info-----------
Version      : 1.6.0
Directory    : /home/ubuntu/.virtualenvs/build/lib/python3.6/site-packages/mxnet
Num GPUs     : 8
Hashtag not found. Not installed from pre-built package.
----------System Info----------
Platform     : Linux-4.4.0-1096-aws-x86_64-with-debian-stretch-sid
system       : Linux
node         : ip-172-31-20-50
release      : 4.4.0-1096-aws
version      : #107-Ubuntu SMP Thu Oct 3 01:51:58 UTC 2019
----------Hardware Info----------
machine      : x86_64
processor    : x86_64
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                64
On-line CPU(s) list:   0-63
Thread(s) per core:    2
Core(s) per socket:    16
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Model name:            Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
Stepping:              1
CPU MHz:               1200.132
CPU max MHz:           3000.0000
CPU min MHz:           1200.0000
BogoMIPS:              4600.11
Hypervisor vendor:     Xen
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              46080K
NUMA node0 CPU(s):     0-15,32-47
NUMA node1 CPU(s):     16-31,48-63
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq monitor est ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single kaiser fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx xsaveopt ida
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0028 sec, LOAD: 0.4811 sec.
Timing for GluonNLP GitHub: https://github.com/dmlc/gluon-nlp, DNS: 0.0005 sec, LOAD: 0.6839 sec.
Timing for GluonNLP: http://gluon-nlp.mxnet.io, DNS: 0.1973 sec, LOAD: 0.4174 sec.
Timing for D2L: http://d2l.ai, DNS: 0.0132 sec, LOAD: 0.1008 sec.
Timing for D2L (zh-cn): http://zh.d2l.ai, DNS: 0.1881 sec, LOAD: 0.0662 sec.
Timing for FashionMNIST: https://repo.mxnet.io/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0567 sec, LOAD: 0.1567 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0027 sec, LOAD: 0.3755 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0016 sec, LOAD: 0.0658 sec.
stephenrawls commented 4 years ago

Saw this on email list and got curious...

Looks like problem is probably this commit: https://github.com/apache/incubator-mxnet/commit/4ed14e2b749743a014121f57b265675fa7b4c06d#diff-875aa4c013dbd73b044531e439e8afdd

Basically MXAPIHandleException used to be defined inline in the header file, so all consumers had to do was

#include <mxnet/c_api_error.h>

But now as of 3 days ago it is not an inline function anymore. Meaning that consumers need to make sure to link against c_api_error.o to get the symbol.

I don't know enough about the build system that produces these nightly builds (does it use the CMake one or the Makefile one?) ... but my hunch would be that either c_api_error.o is not getting built into libmxnet.so. Or somehow it is, but the order it is presented to the linker is before MXAPIHandleException is used, so that symbol isn't included in libmxnet.so.

leezu commented 4 years ago

@stephenrawls currently a Makefile build is used. You can find it at https://github.com/apache/incubator-mxnet/tree/master/tools/staticbuild We are working on migrating to the cmake build in the future though.

eric-haibin-lin commented 4 years ago

@szha

apeforest commented 4 years ago

Thanks @stephenrawls for the analysis. Here are the causes of the problem:

1) Horovod uses MX_API_BEGIN() and MX_API_END() from mxnet/c_api_error.h to catch and throw errors in horovod APIs: https://github.com/horovod/horovod/blob/master/horovod/mxnet/mpi_ops.cc#L224 2) MX_API_BEGIN() is a macro that calls MXAPIHandleException https://github.com/apache/incubator-mxnet/blob/master/include/mxnet/c_api_error.h#L36 3) Before #17128, MXAPIHandleException is an inline function. And therefore when #17128 introduced a new function call NormalizeError() inside MXAPIHandleException it broke Horovod integration because the symbol of NormalizeError is not whitelist by MXNet distribution. 4) #17298 removed NormalizeError() from MXAPIHandleException and made it not inline. https://github.com/apache/incubator-mxnet/pull/17208/files#diff-875aa4c013dbd73b044531e439e8afddR67. This time the error becomes undefined symbol of MXAPIHandleException.

So to summarize, the problem is not that Horovod requires MXAPIHandleException function to be inline. The rootcause is that MXNet did not export the symbol *MXAPIHandleException* in its whitelist, but only the symbols that are being used inside MXAPIHandleException function. It was okay when the function MXAPIHandleException is inline, but became a problem when it's not. A good practice is to whitelist symbol *MXAPIHandleException* instead of its internals.

@szha I will create a PR to fix this.