apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.78k stars 6.79k forks source link

/usr/lib64/libc.so.6 cause a mxnet segmentation. #18295

Closed TriLoo closed 4 years ago

TriLoo commented 4 years ago

Description

when using cpp-package to do inference, a dmlc::Error arose.

Error Message

the output of bt in gdb:

#0  0x00007fe0287f65f7 in raise () from /usr/lib64/libc.so.6
#1  0x00007fe0287f7ce8 in abort () from /usr/lib64/libc.so.6
#2  0x00007fe0290fb9d5 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib64/libstdc++.so.6
#3  0x00007fe0290f9946 in ?? () from /usr/lib64/libstdc++.so.6
#4  0x00007fe0290f8909 in ?? () from /usr/lib64/libstdc++.so.6
#5  0x00007fe0290f9574 in __gxx_personality_v0 () from /usr/lib64/libstdc++.so.6
#6  0x00007fe028b92903 in ?? () from /usr/lib64/libgcc_s.so.1
#7  0x00007fe028b92c9b in _Unwind_RaiseException () from /usr/lib64/libgcc_s.so.1
#8  0x00007fe0290f9b86 in __cxa_throw () from /usr/lib64/libstdc++.so.6
#9  0x00007fe035a5c148 in void mshadow::DeleteStream<mshadow::gpu>(mshadow::Stream<mshadow::gpu>*) () from /search/odin/songminghui/githubs/incubator-mxnet/lib/libmxnet.so
#10 0x00007fe035a5d163 in mshadow::Stream<mshadow::gpu>* mshadow::NewStream<mshadow::gpu>(bool, bool, int) () from /search/odin/songminghui/githubs/incubator-mxnet/lib/libmxnet.so
#11 0x00007fe035a73bbf in void mxnet::engine::ThreadedEnginePerDevice::GPUWorker<(dmlc::ConcurrentQueueType)0>(mxnet::Context, bool, mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>*, std::shared_ptr<dmlc::ManualEvent> const&) ()
   from /search/odin/songminghui/githubs/incubator-mxnet/lib/libmxnet.so
#12 0x00007fe035a73e0e in std::_Function_handler<void (std::shared_ptr<dmlc::ManualEvent>), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#4}::operator()() const::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr<dmlc::ManualEvent>&&) () from /search/odin/songminghui/githubs/incubator-mxnet/lib/libmxnet.so
#13 0x00007fe035a6f0ca in std::thread::_State_impl<std::thread::_Invoker<std::tuple<std::function<void (std::shared_ptr<dmlc::ManualEvent>)>, std::shared_ptr<dmlc::ManualEvent> > > >::_M_run() () from /search/odin/songminghui/githubs/incubator-mxnet/lib/libmxnet.so
#14 0x00007fe04279649f in execute_native_thread_routine () from /search/odin/songminghui/githubs/incubator-mxnet/lib/libmxnet.so
#15 0x00007fe0293aadc5 in start_thread () from /usr/lib64/libpthread.so.0
#16 0x00007fe0288b7ced in clone () from /usr/lib64/libc.so.6 

To Reproduce

Steps to reproduce

(Paste the commands you ran that produced the error.)

  1. download the mxnet, commit head: de510582438ad5fad576eba1b85c845b0ba9989c
  2. build it with the cmake, and mkl, mkldnn enabled
  3. run an own cpp inference code, then this error arise

What have you tried to solve it?

  1. Change the gcc/g++ to same as that used by mxnet, i.e. gcc 7.3.1, not work

Environment

  1. centos 7.2.1511
  2. gcc/g++: 7.3.1
  3. cuda: 10.0
----------Python Info----------
Version      : 3.6.8
Compiler     : GCC 7.3.0
Build        : ('default', 'Dec 30 2018 01:22:34')
Arch         : ('64bit', '')
------------Pip Info-----------
Version      : 19.1.1
Directory    : /search/odin/songminghui/anaconda3/lib/python3.6/site-packages/pip
----------MXNet Info-----------
Version      : 2.0.0
Directory    : /search/odin/songminghui/githubs/incubator-mxnet/python/mxnet
Num GPUs     : 8
Hashtag not found. Not installed from pre-built package.
----------System Info----------
Platform     : Linux-3.10.0-327.el7.x86_64-x86_64-with-centos-7.2.1511-Core
system       : Linux
node         : nmyjs_176_61
release      : 3.10.0-327.el7.x86_64
version      : #1 SMP Thu Nov 19 22:10:57 UTC 2015
----------Hardware Info----------
machine      : x86_64
processor    : x86_64
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                48
On-line CPU(s) list:   0-47
Thread(s) per core:    2
Core(s) per socket:    12
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Model name:            Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz
Stepping:              1
CPU MHz:               2195.104
BogoMIPS:              4398.47
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              30720K
NUMA node0 CPU(s):     0-11,24-35
NUMA node1 CPU(s):     12-23,36-47
----------Network Test----------
Setting timeout: 10
Error open MXNet: https://github.com/apache/incubator-mxnet, <urlopen error timed out>, DNS finished in 1.2967002391815186 sec.
Error open GluonNLP GitHub: https://github.com/dmlc/gluon-nlp, <urlopen error timed out>, DNS finished in 4.1484832763671875e-05 sec.
Timing for GluonNLP: http://gluon-nlp.mxnet.io, DNS: 1.1775 sec, LOAD: 3.5363 sec.
Timing for D2L: http://d2l.ai, DNS: 0.4321 sec, LOAD: 0.8394 sec.
Timing for D2L (zh-cn): http://zh.d2l.ai, DNS: 0.3948 sec, LOAD: 1.0763 sec.
Timing for FashionMNIST: https://repo.mxnet.io/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.4420 sec, LOAD: 2.9380 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.1293 sec, LOAD: 13.5263 sec.
Error open Conda: https://repo.continuum.io/pkgs/free/, HTTP Error 403: Forbidden, DNS finished in 0.47435927391052246 sec.
leezu commented 4 years ago

Did you call MXNotifyShutdown();? cf https://github.com/apache/incubator-mxnet/commit/afb750570e71545b7b58d7374f586b21156d36d8

TriLoo commented 4 years ago

a wrong input caused this error, and this error is fixed now. thank. @leezu