apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.73k stars 6.81k forks source link

Finetune Error: Segment Fault (core dumped) #6361

Closed ysh329 closed 6 years ago

ysh329 commented 7 years ago

I fine-tune pretrained resnet-200 on myself data. I referred tutorial codes from docs about finetune.

yuanshuai@linux-W580-G20:~/code/mxnet_inference/ccs/finetune-models$ python run_finetune.py './resnet-200/resnet-200' '0' './resnt-200/finetune-resnet-200-train-add-seg-224' '30'
[11:44:01] src/nnvm/legacy_json_util.cc:190: Loading symbol saved by previous version v0.8.0. Attempting to upgrade...
[11:44:01] src/nnvm/legacy_json_util.cc:198: Symbol successfully upgraded!
[11:44:01] src/io/iter_image_recordio_2.cc:135: ImageRecordIOParser2: /home/yuanshuai/code/mxnet/example/image-classification/data/ccs-train-add-seg-224_train.rec, use 4 threads for decoding..
[11:44:02] src/io/iter_image_recordio_2.cc:135: ImageRecordIOParser2: /home/yuanshuai/code/mxnet/example/image-classification/data/ccs-train-add-seg-224_val.rec, use 4 threads for decoding..
[11:44:11] src/kvstore/././comm.h:304: only 2 out of 6 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off
[11:44:11] src/kvstore/././comm.h:313: ...
[11:44:11] src/kvstore/././comm.h:313: ..v
[11:44:11] src/kvstore/././comm.h:313: .v.
[11:44:11] /home/yuanshuai/code/mxnet/dmlc-core/include/dmlc/logging.h:303: [11:44:11] src/operator/./convolution-inl.h:109: Check failed: req[conv::kOut] == kWriteTo (0 vs. 1) 

Stack trace returned 8 entries:
[bt] (0) /home/yuanshuai/code/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7f213b2aac6c]
[bt] (1) /home/yuanshuai/code/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet2op13ConvolutionOpIN7mshadow3gpuEfE7ForwardERKNS_9OpContextERKSt6vectorINS_5TBlobESaIS9_EERKS8_INS_9OpReqTypeESaISE_EESD_SD_+0xcf) [0x7f213c8b6bef]
[bt] (2) /home/yuanshuai/code/mxnet/python/mxnet/../../lib/libmxnet.so(+0xf26186) [0x7f213bc1f186]
[bt] (3) /home/yuanshuai/code/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine15ExecuteOprBlockENS_10RunContextEPNS0_8OprBlockE+0x8c) [0x7f213bc09c6c]
[bt] (4) /home/yuanshuai/code/mxnet/python/mxnet/../../lib/libmxnet.so(_ZNSt17_Function_handlerIFvvEZZN5mxnet6engine23ThreadedEnginePerDevice13PushToExecuteEPNS2_8OprBlockEbENKUlvE1_clEvEUlvE_E9_M_invokeERKSt9_Any_data+0x60) [0x7f213bc0c9f0]
[bt] (5) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb1a60) [0x7f2179481a60]
[bt] (6) /lib/x86_64-linux-gnu/libpthread.so.0(+0x8184) [0x7f218388d184]
[bt] (7) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f21835babed]

[11:44:11] /home/yuanshuai/code/mxnet/dmlc-core/include/dmlc/logging.h:303: [11:44:11] src/engine/./threaded_engine.h:329: [11:44:11] src/operator/./convolution-inl.h:109: Check failed: req[conv::kOut] == kWriteTo (0 vs. 1) 

Stack trace returned 8 entries:
[bt] (0) /home/yuanshuai/code/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7f213b2aac6c]
[bt] (1) /home/yuanshuai/code/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet2op13ConvolutionOpIN7mshadow3gpuEfE7ForwardERKNS_9OpContextERKSt6vectorINS_5TBlobESaIS9_EERKS8_INS_9OpReqTypeESaISE_EESD_SD_+0xcf) [0x7f213c8b6bef]
[bt] (2) /home/yuanshuai/code/mxnet/python/mxnet/../../lib/libmxnet.so(+0xf26186) [0x7f213bc1f186]
[bt] (3) /home/yuanshuai/code/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine15ExecuteOprBlockENS_10RunContextEPNS0_8OprBlockE+0x8c) [0x7f213bc09c6c]
[bt] (4) /home/yuanshuai/code/mxnet/python/mxnet/../../lib/libmxnet.so(_ZNSt17_Function_handlerIFvvEZZN5mxnet6engine23ThreadedEnginePerDevice13PushToExecuteEPNS2_8OprBlockEbENKUlvE1_clEvEUlvE_E9_M_invokeERKSt9_Any_data+0x60) [0x7f213bc0c9f0]
[bt] (5) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb1a60) [0x7f2179481a60]
[bt] (6) /lib/x86_64-linux-gnu/libpthread.so.0(+0x8184) [0x7f218388d184]
[bt] (7) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f21835babed]

An fatal error occurred in asynchronous engine operation. If you do not know what caused this error, you can try set environment variable MXNET_ENGINE_TYPE to NaiveEngine and run with debugger (i.e. gdb). This will force all operations to be synchronous and backtrace will give you the series of calls that lead to this error. Remember to set MXNET_ENGINE_TYPE back to empty after debugging.

Stack trace returned 6 entries:
[bt] (0) /home/yuanshuai/code/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7f213b2aac6c]
[bt] (1) /home/yuanshuai/code/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine15ExecuteOprBlockENS_10RunContextEPNS0_8OprBlockE+0x376) [0x7f213bc09f56]
[bt] (2) /home/yuanshuai/code/mxnet/python/mxnet/../../lib/libmxnet.so(_ZNSt17_Function_handlerIFvvEZZN5mxnet6engine23ThreadedEnginePerDevice13PushToExecuteEPNS2_8OprBlockEbENKUlvE1_clEvEUlvE_E9_M_invokeERKSt9_Any_data+0x60) [0x7f213bc0c9f0]
[bt] (3) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb1a60) [0x7f2179481a60]
[bt] (4) /lib/x86_64-linux-gnu/libpthread.so.0(+0x8184) [0x7f218388d184]
[bt] (5) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f21835babed]

terminate called after throwing an instance of 'dmlc::Error'
  what():  [11:44:11] src/engine/./threaded_engine.h:329: [11:44:11] src/operator/./convolution-inl.h:109: Check failed: req[conv::kOut] == kWriteTo (0 vs. 1) 

Stack trace returned 8 entries:
[bt] (0) /home/yuanshuai/code/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7f213b2aac6c]
[bt] (1) /home/yuanshuai/code/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet2op13ConvolutionOpIN7mshadow3gpuEfE7ForwardERKNS_9OpContextERKSt6vectorINS_5TBlobESaIS9_EERKS8_INS_9OpReqTypeESaISE_EESD_SD_+0xcf) [0x7f213c8b6bef]
[bt] (2) /home/yuanshuai/code/mxnet/python/mxnet/../../lib/libmxnet.so(+0xf26186) [0x7f213bc1f186]
[bt] (3) /home/yuanshuai/code/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine15ExecuteOprBlockENS_10RunContextEPNS0_8OprBlockE+0x8c) [0x7f213bc09c6c]
[bt] (4) /home/yuanshuai/code/mxnet/python/mxnet/../../lib/libmxnet.so(_ZNSt17_Function_handlerIFvvEZZN5mxnet6engine23ThreadedEnginePerDevice13PushToExecuteEPNS2_8OprBlockEbENKUlvE1_clEvEUlvE_E9_M_invokeERKSt9_Any_data+0x60) [0x7f213bc0c9f0]
[bt] (5) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb1a60) [0x7f2179481a60]
[bt] (6) /lib/x86_64-linux-gnu/libpthread.so.0(+0x8184) [0x7f218388d184]
[bt] (7) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f21835babed]

An fatal error occurred in asynchronous engine operation. If you do not know what caused this error, you can try set environment variable MXNET_ENGINE_TYPE to NaiveEngine and run with debugger (i.e. gdb). This will force all operations to be synchronous and backtrace will give you the series of calls that lead to this error. Remember to set MXNET_ENGINE_TYPE back to empty after debugging.

Stack trace returned 6 entries:
[bt] (0) /home/yuanshuai/code/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7f213b2aac6c]
[bt] (1) /home/yuanshuai/code/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine15ExecuteOprBlockENS_10RunContextEPNS0_8OprBlockE+0x376) [0x7f213bc09f56]
[bt] (2) /home/yuanshuai/code/mxnet/python/mxnet/../../lib/libmxnet.so(_ZNSt17_Function_handlerIFvvEZZN5mxnet6engine23ThreadedEnginePerDevice13PushToExecuteEPNS2_8OprBlockEbENKUlvE1_clEvEUlvE_E9_M_invokeERKSt9_Any_data+0x60) [0x7f213bc0c9f0]
[bt] (3) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb1a60) [0x7f2179481a60]
[bt] (4) /lib/x86_64-linux-gnu/libpthread.so.0(+0x8184) [0x7f218388d184]
[bt] (5) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f21835babed]

已放弃 (核心已转储)
piiswrong commented 7 years ago

@tqchen

ysh329 commented 7 years ago

@piiswrong Besides, I tested my other codes( make inference ). I found that Segment Fault (core dumped) appears when using cv2, such as cv2.imread or cv2.resize. When I replaced cv2.imread with skimage.io.imread or skimage's resize, it's okay.

ysh329 commented 7 years ago

@piiswrong I found it seems about libcudartOSError: libcudart.so.8.0: cannot open shared object file: No such file or directory. However, I tried to ldd my libmxnet.so file, it's okay to find this libcudart file.

yuanshuai@linux-W580-G20:~/code/mxnet_inference/ccs/finetune-models$ sudo ./run_finetune_script.sh 
[resnet-200-train-add-seg-224-lr-0.01]
Traceback (most recent call last):
  File "run_finetune.py", line 16, in <module>
    import mxnet as mx
  File "/usr/local/lib/python2.7/dist-packages/mxnet/__init__.py", line 7, in <module>
    from .base import MXNetError
  File "/usr/local/lib/python2.7/dist-packages/mxnet/base.py", line 43, in <module>
    _LIB = _load_lib()
  File "/usr/local/lib/python2.7/dist-packages/mxnet/base.py", line 35, in _load_lib
    lib = ctypes.CDLL(lib_path[0], ctypes.RTLD_GLOBAL)
  File "/usr/lib/python2.7/ctypes/__init__.py", line 365, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: libcudart.so.8.0: cannot open shared object file: No such file or directory
yuanshuai@linux-W580-G20:~/code/mxnet_inference/ccs/finetune-models$ ldd ~/code/mxnet/lib/libmxnet.so | grep cudart
        libcudart.so.8.0 => /usr/local/cuda-8.0/lib64/libcudart.so.8.0 (0x00007f043efd7000)
yuanshuai@linux-W580-G20:~/code/mxnet_inference/ccs/finetune-models$ 

But in fact, I think it doesn't matter libcudart. I think this problem directly related to opencv.

fighting-liu commented 7 years ago

The same problem occurs to me!

fighting-liu commented 7 years ago

import cv2 before import mxnet will solve this issue

ysh329 commented 7 years ago

@fighting-liu Very thanks my brother. I change a way using Docker to fix it.

ysh329 commented 6 years ago

temporary solution