apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.77k stars 6.79k forks source link

mxnet-cu101mkl==1.6.0b20191006 cannot be used with horovod via pip install #17119

Closed feevos closed 4 years ago

feevos commented 4 years ago

Description

Installed mxnet-cu101mkl==1.6.0b20191006 via pip, then installed (via pip) horovod 0.18.2

HOROVOD_WITHOUT_TENSORFLOW=1 HOROVOD_WITH_PYTORCH=1  HOROVOD_WITH_MXNET=1 HOROVOD_CUDA_HOME=$CUDA_HOME HOROVOD_NCCL_HOME=$NCCL_HOME  pip install -t . --no-cache-dir horovod==0.18.2

After that, I cannot use:

import horovod.mxnet as hvd

Error Message

I am getting the error:

In [1]: import horovod.mxnet as hvd                                                                                                                                                                                                                                                         
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-1-10f17278b226> in <module>
----> 1 import horovod.mxnet as hvd

/scratch1/dia021/Software/horovod/mxnet/__init__.py in <module>
     23                 __file__, 'mpi_lib')
     24 
---> 25 from horovod.mxnet.mpi_ops import allgather
     26 from horovod.mxnet.mpi_ops import allreduce, allreduce_
     27 from horovod.mxnet.mpi_ops import broadcast, broadcast_

/scratch1/dia021/Software/horovod/mxnet/mpi_ops.py in <module>
     27 from horovod.common.util import get_ext_suffix
     28 from horovod.common.basics import HorovodBasics as _HorovodBasics
---> 29 _basics = _HorovodBasics(__file__, 'mpi_lib')
     30 
     31 # import basic methods

/scratch1/dia021/Software/horovod/common/basics.py in __init__(self, pkg_path, *args)
     25     def __init__(self, pkg_path, *args):
     26         full_path = util.get_extension_full_path(pkg_path, *args)
---> 27         self.MPI_LIB_CTYPES = ctypes.CDLL(full_path, mode=ctypes.RTLD_GLOBAL)
     28 
     29     def init(self, comm=None):

/scratch1/dia021/Software/anaconda3/lib/python3.7/ctypes/__init__.py in __init__(self, name, mode, handle, use_errno, use_last_error)
    362 
    363         if handle is None:
--> 364             self._handle = _dlopen(self._name, mode)
    365         else:
    366             self._handle = handle

OSError: /scratch1/dia021/Software/horovod/mxnet/mpi_lib.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _ZN5mxnet12on_enter_apiEPKc

To Reproduce

import horovod.mxnet as hvd

What have you tried to solve it?

I've installed a previous version of mxnet, namely: mxnet-cu101mkl 1.6.0b20190811 and now horovod is working. I don't know in which version after this one horovod integration breaks.

Environment

We recommend using our script for collecting the diagnositc information. Run the following command and paste the outputs below:

curl --retry 10 -s https://raw.githubusercontent.com/dmlc/gluon-nlp/master/tools/diagnose.py | python

# paste outputs here
----------Python Info----------
Version      : 3.7.4
Compiler     : GCC 7.3.0
Build        : ('default', 'Aug 13 2019 20:35:49')
Arch         : ('64bit', '')
------------Pip Info-----------
Version      : 19.2.3
Directory    : /scratch1/dia021/Software/anaconda3/lib/python3.7/site-packages/pip
----------MXNet Info-----------
Version      : 1.6.0
Directory    : /scratch1/dia021/Software/mxnet
Num GPUs     : 4
Commit Hash   : 614cba3042567fd8de84b9cd8122feabb9e85b5c
----------System Info----------
Platform     : Linux-4.12.14-95.32-default-x86_64-with-SuSE-12-x86_64
system       : Linux
node         : bracewell-i1
release      : 4.12.14-95.32-default
version      : #1 SMP Wed Sep 18 05:00:42 UTC 2019 (5fbcb00)
----------Hardware Info----------
machine      : x86_64
processor    : x86_64
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                56
On-line CPU(s) list:   0-27
Off-line CPU(s) list:  28-55
Thread(s) per core:    1
Core(s) per socket:    14
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Model name:            Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz
Stepping:              1
CPU MHz:               2599.858
BogoMIPS:              5199.71
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              35840K
NUMA node0 CPU(s):     0,2,4,6,8,10,12,14,16,18,20,22,24,26
NUMA node1 CPU(s):     1,3,5,7,9,11,13,15,17,19,21,23,25,27
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx intel_pt xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts md_clear flush_l1d
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0034 sec, LOAD: 0.6963 sec.
Timing for GluonNLP GitHub: https://github.com/dmlc/gluon-nlp, DNS: 0.0008 sec, LOAD: 0.5527 sec.
Timing for GluonNLP: http://gluon-nlp.mxnet.io, DNS: 0.2734 sec, LOAD: 0.9155 sec.
Timing for D2L: http://d2l.ai, DNS: 0.3147 sec, LOAD: 1.7114 sec.
Timing for D2L (zh-cn): http://zh.d2l.ai, DNS: 0.0076 sec, LOAD: 0.1963 sec.
Timing for FashionMNIST: https://repo.mxnet.io/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.1486 sec, LOAD: 0.7043 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0075 sec, LOAD: 0.8764 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0087 sec, LOAD: 0.3575 sec.
leezu commented 4 years ago

Duplicate https://github.com/apache/incubator-mxnet/issues/16193# You can use the builds from https://repo.mxnet.io/dist/2019-12-18/dist/mxnet_cu101mkl-1.6.0b20191218-py2.py3-none-manylinux1_x86_64.whl

Also see https://lists.apache.org/thread.html/cba5e34a59af5e3ce99bbad912341efef32fb409c022d9239aa2994a@%3Cdev.mxnet.apache.org%3E why nightly builds moved.