apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.79k stars 6.79k forks source link

mxnet.base.MXNetError: MXNetError: Error in operator batchnorm6 #19717

Open Justobe opened 3 years ago

Justobe commented 3 years ago

Description

mxnet throws an exception when I try to build my model and use mxnet as the backend of keras. However, my script runs successfully on other backends of keras (such as tensorflow and cntk). I further found that the problem may be caused by batch normalization in the program when using mxnet. I also noticed that this issue was mentioned in #15721, but this issue still exists in the latest keras-mxnet 2.2.4.3 and mxnet-cu101 1.7

Error Message

Traceback (most recent call last): File "crash_checker.py", line 67, in model.add(Dense(10, activation='softmax')) File "/root/anaconda3/envs/mxnet_170/lib/python3.6/site-packages/keras/engine/sequential.py", line 181, in add output_tensor = layer(self.outputs[0]) File "/root/anaconda3/envs/mxnet_170/lib/python3.6/site-packages/keras/engine/base_layer.py", line 470, in call output = self.call(inputs, kwargs) File "/root/anaconda3/envs/mxnet_170/lib/python3.6/site-packages/keras/layers/core.py", line 893, in call output = K.bias_add(output, self.bias, data_format='channels_last') File "/root/anaconda3/envs/mxnet_170/lib/python3.6/site-packages/keras/backend/mxnet_backend.py", line 94, in func_wrapper train_symbol = func(*args, *kwargs) File "/root/anaconda3/envs/mxnet_170/lib/python3.6/site-packages/keras/backend/mxnet_backend.py", line 3982, in bias_add x_dim = ndim(x) File "/root/anaconda3/envs/mxnet_170/lib/python3.6/site-packages/keras/backend/mxnet_backend.py", line 535, in ndim shape = x.shape File "/root/anaconda3/envs/mxnet_170/lib/python3.6/site-packages/keras/backend/mxnet_backend.py", line 4395, in shape return self._get_shape() File "/root/anaconda3/envs/mxnet_170/lib/python3.6/site-packages/keras/backend/mxnet_backend.py", line 4404, in _getshape , outshape, = self.symbol.infer_shape_partial() File "/root/anaconda3/envs/mxnet_170/lib/python3.6/site-packages/mxnet/symbol/symbol.py", line 1177, in infer_shape_partial return self._infer_shape_impl(True, args, kwargs) File "/root/anaconda3/envs/mxnet_170/lib/python3.6/site-packages/mxnet/symbol/symbol.py", line 1265, in _infer_shape_impl ctypes.byref(complete))) File "/root/anaconda3/envs/mxnet_170/lib/python3.6/site-packages/mxnet/base.py", line 246, in check_call raise get_last_ffi_error() mxnet.base.MXNetError: MXNetError: Error in operator batchnorm6: [16:26:44] include/mxnet/./tuple.h:245: Check failed: i >= 0 && i < ndim(): index = -2 must be in range [0, -1)

To Reproduce

I provide a simple script to reproduce the bug, run the following script such as:

import os
import sys
bk = sys.argv[1]
os.environ['KERAS_BACKEND'] = bk
from keras import backend as K
import keras

from keras.models import Sequential
from keras.layers.core import Dense
from keras.layers import Conv2D,MaxPooling2D,BatchNormalization,Flatten,Dropout

model = Sequential()

model.add(Conv2D(96, (3,3), strides=(2,2), activation='relu', padding='same', input_shape=(32, 32, 3,)))
model.add(MaxPooling2D(pool_size=(2, 2), strides=(2,2)))
# Local Response normalization for Original Alexnet
model.add(BatchNormalization())

model.add(Conv2D(96, (3,3), activation='relu', padding='same'))
model.add(MaxPooling2D(pool_size=(3, 3), strides=(2,2)))
model.add(BatchNormalization())

model.add(Conv2D(192, (3,3), activation='relu', padding='same'))
model.add(Conv2D(192, (3,3), activation='relu', padding='same'))
model.add(Conv2D(256, (3,3), activation='relu', padding='same'))
model.add(MaxPooling2D(pool_size=(3, 3), strides=(2,2)))
model.add(BatchNormalization())

model.add(Flatten())
model.add(Dense(512, activation='tanh'))
model.add(Dropout(0.5))
model.add(Dense(256, activation='tanh'))

# Comment out this line of code, mxnet runs successfully
# However, this script runs successfully on both tensorflow and cntk
model.add(BatchNormalization())

model.add(Dropout(0.5))
model.add(Dense(10, activation='softmax'))

# print the model summary
model.summary()

Steps to reproduce

python myscript.py mxnet (change mxnet to tensorflow if you want to test under backend tensorflow)

Environment

Package             Version
------------------- -------------------
cached-property     1.5.2
certifi             2020.12.5
chardet             4.0.0
cycler              0.10.0
decorator           4.4.2
graphviz            0.8.4
h5py                2.10.0
idna                2.10
Keras-Applications  1.0.8
keras-mxnet         2.2.4.3
Keras-Preprocessing 1.1.2
kiwisolver          1.3.1
matplotlib          3.2.2
mxnet-cu101         1.7.0
networkx            2.5
numpy               1.19.4
pandas              0.23.0
Pillow              5.1.0
pip                 20.3.3
pyparsing           2.4.7
python-dateutil     2.8.1
pytz                2020.5
PyWavelets          1.1.1
PyYAML              5.3.1
redis               3.3.2
requests            2.25.1
scikit-image        0.13.1
scikit-learn        0.19.1
scipy               1.1.0
setuptools          51.0.0.post20201207
six                 1.15.0
urllib3             1.26.2
wheel               0.36.2
szha commented 3 years ago

cc @sandeep-krishnamurthy

yangshuo0323 commented 3 years ago

I see you have trained your model based on MXNet version 1.7.0. I want to train BERT on mutiple GPU, and I have another doubt want to consult you. Do you meet this trouble:

[1,4]<stderr>:===================
[1,5]<stderr>:[node106:26502:0:26502] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x30)
[1,5]<stderr>:==== backtrace ====
[1,6]<stderr>:[node106:26503:0:26503] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x30)
[1,6]<stderr>:==== backtrace ====
[1,5]<stderr>:    0  /usr/lib/libucs.so.0(+0x1fcec) [0x7f40f065bcec]
[1,5]<stderr>:    1  /usr/lib/libucs.so.0(+0x1ff64) [0x7f40f065bf64]
[1,5]<stderr>:    2  /lib/x86_64-linux-gnu/libpthread.so.0(pthread_mutex_lock+0x4) [0x7f42ead77d44]
[1,5]<stderr>:    3  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet6engine11ThreadedVar21AppendWriteDependencyEPNS0_8OprBlockE+0x44) [0x7f428d022564]
[1,5]<stderr>:    4  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine4PushEPNS0_3OprENS_7ContextEib+0x280) [0x7f428d025790]
[1,5]<stderr>:    5  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine9PushAsyncESt8functionIFvNS_10RunContextENS0_18CallbackOnCompleteEEENS_7ContextERKSt6vectorIPNS0_3VarESaISA_EESE_NS_10FnPropertyEiPKcb+0x131) [0x7f428d01ded1]
[1,5]<stderr>:    6  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet10CopyFromToERKNS_7NDArrayES2_ib+0xaf4) [0x7f428cff89d4]
[1,5]<stderr>:    7  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/horovod/mxnet/mpi_lib.cpython-37m-x86_64-linux-gnu.so(_ZN7horovod5mxnet29PushHorovodOperationCudaOnCPUENS_6common7Request11RequestTypeEPN5mxnet7NDArrayES6_PKcii+0xe6f) [0x7f410243a18f]
[1,5]<stderr>:    8  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/horovod/mxnet/mpi_lib.cpython-37m-x86_64-linux-gnu.so(horovod_mxnet_broadcast_async+0x54) [0x7f4102431d84]
[1,5]<stderr>:    9  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/../../libffi.so.7(+0x69dd) [0x7f42e9da49dd]
[1,5]<stderr>:   10  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/../../libffi.so.7(+0x6067) [0x7f42e9da4067]
[1,5]<stderr>:   11  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so(_ctypes_callproc+0x2ce) [0x7f42eafd527e]
[1,5]<stderr>:   12  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so(+0x12cb4) [0x7f42eafd5cb4]
[1,5]<stderr>:   13  python(_PyObject_FastCallKeywords+0x48b) [0x564d0453c00b]
[1,5]<stderr>:   14  python(_PyEval_EvalFrameDefault+0x51d1) [0x564d045a09a1]
[1,5]<stderr>:   15  python(_PyEval_EvalCodeWithName+0x2f9) [0x564d044e42b9]
[1,5]<stderr>:   16  python(_PyFunction_FastCallKeywords+0x387) [0x564d04534497]
[1,5]<stderr>:   17  python(_PyEval_EvalFrameDefault+0x14ea) [0x564d0459ccba]
[1,5]<stderr>:   18  python(_PyEval_EvalCodeWithName+0x2f9) [0x564d044e42b9]
[1,5]<stderr>:   19  python(_PyFunction_FastCallKeywords+0x387) [0x564d04534497]
[1,5]<stderr>:   20  python(_PyEval_EvalFrameDefault+0x14ea) [0x564d0459ccba]
[1,5]<stderr>:   21  python(_PyFunction_FastCallKeywords+0xfb) [0x564d0453420b]
[1,5]<stderr>:   22  python(_PyEval_EvalFrameDefault+0x416) [0x564d0459bbe6]
[1,5]<stderr>:   23  python(_PyEval_EvalCodeWithName+0x2f9) [0x564d044e42b9]
[1,5]<stderr>:   24  python(PyEval_EvalCodeEx+0x44) [0x564d044e51d4]
[1,5]<stderr>:   25  python(PyEval_EvalCode+0x1c) [0x564d044e51fc]
[1,5]<stderr>:   26  python(+0x22bf44) [0x564d045faf44]
[1,5]<stderr>:   27  python(PyRun_FileExFlags+0xa1) [0x564d046052b1]
[1,5]<stderr>:   28  python(PyRun_SimpleFileExFlags+0x1c3) [0x564d046054a3]
[1,5]<stderr>:   29  python(+0x2375d5) [0x564d046065d5]
[1,5]<stderr>:   30  python(_Py_UnixMain+0x3c) [0x564d046066fc]
[1,5]<stderr>:   31  /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7f42ea9c4840]
[1,5]<stderr>:   32  python(+0x1dc3c0) [0x564d045ab3c0]
[1,5]<stderr>:===================
[1,6]<stderr>:    0  /usr/lib/libucs.so.0(+0x1fcec) [0x7f1a6c25bcec]
[1,6]<stderr>:    1  /usr/lib/libucs.so.0(+0x1ff64) [0x7f1a6c25bf64]
[1,6]<stderr>:    2  /lib/x86_64-linux-gnu/libpthread.so.0(pthread_mutex_lock+0x4) [0x7f1c66a2ad44]
[1,6]<stderr>:    3  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet6engine11ThreadedVar21AppendWriteDependencyEPNS0_8OprBlockE+0x44) [0x7f1c08cd5564]
[1,6]<stderr>:    4  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine4PushEPNS0_3OprENS_7ContextEib+0x280) [0x7f1c08cd8790]
[1,6]<stderr>:    5  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine9PushAsyncESt8functionIFvNS_10RunContextENS0_18CallbackOnCompleteEEENS_7ContextERKSt6vectorIPNS0_3VarESaISA_EESE_NS_10FnPropertyEiPKcb+0x131) [0x7f1c08cd0ed1]
[1,6]<stderr>:    6  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet10CopyFromToERKNS_7NDArrayES2_ib+0xaf4) [0x7f1c08cab9d4]
[1,6]<stderr>:    7  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/horovod/mxnet/mpi_lib.cpython-37m-x86_64-linux-gnu.so(_ZN7horovod5mxnet29PushHorovodOperationCudaOnCPUENS_6common7Request11RequestTypeEPN5mxnet7NDArrayES6_PKcii+0xe6f) [0x7f1a7e0e118f]
[1,6]<stderr>:    8  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/horovod/mxnet/mpi_lib.cpython-37m-x86_64-linux-gnu.so(horovod_mxnet_broadcast_async+0x54) [0x7f1a7e0d8d84]
[1,6]<stderr>:    9  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/../../libffi.so.7(+0x69dd) [0x7f1c65a579dd]
[1,6]<stderr>:   10  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/../../libffi.so.7(+0x6067) [0x7f1c65a57067]
[1,6]<stderr>:   11  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so(_ctypes_callproc+0x2ce) [0x7f1c66c8827e]
[1,6]<stderr>:   12  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so(+0x12cb4) [0x7f1c66c88cb4]
[1,6]<stderr>:   13  python(_PyObject_FastCallKeywords+0x48b) [0x562df52e800b]
[1,6]<stderr>:   14  python(_PyEval_EvalFrameDefault+0x51d1) [0x562df534c9a1]
[1,6]<stderr>:   15  python(_PyEval_EvalCodeWithName+0x2f9) [0x562df52902b9]
[1,6]<stderr>:   16  python(_PyFunction_FastCallKeywords+0x387) [0x562df52e0497]
[1,6]<stderr>:   17  python(_PyEval_EvalFrameDefault+0x14ea) [0x562df5348cba]
[1,6]<stderr>:   18  python(_PyEval_EvalCodeWithName+0x2f9) [0x562df52902b9]
[1,6]<stderr>:   19  python(_PyFunction_FastCallKeywords+0x387) [0x562df52e0497]
[1,6]<stderr>:   20  python(_PyEval_EvalFrameDefault+0x14ea) [0x562df5348cba]
[1,6]<stderr>:   21  python(_PyFunction_FastCallKeywords+0xfb) [0x562df52e020b]
[1,6]<stderr>:   22  python(_PyEval_EvalFrameDefault+0x416) [0x562df5347be6]
[1,6]<stderr>:   23  python(_PyEval_EvalCodeWithName+0x2f9) [0x562df52902b9]
[1,6]<stderr>:   24  python(PyEval_EvalCodeEx+0x44) [0x562df52911d4]
[1,6]<stderr>:   25  python(PyEval_EvalCode+0x1c) [0x562df52911fc]
[1,6]<stderr>:   26  python(+0x22bf44) [0x562df53a6f44]
[1,6]<stderr>:   27  python(PyRun_FileExFlags+0xa1) [0x562df53b12b1]
[1,6]<stderr>:   28  python(PyRun_SimpleFileExFlags+0x1c3) [0x562df53b14a3]
[1,6]<stderr>:   29  python(+0x2375d5) [0x562df53b25d5]
[1,6]<stderr>:   30  python(_Py_UnixMain+0x3c) [0x562df53b26fc]
[1,6]<stderr>:   31  /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7f1c66677840]
[1,6]<stderr>:   32  python(+0x1dc3c0) [0x562df53573c0]
[1,6]<stderr>:===================
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 7 with PID 0 on node node106 exited on signal 11 (Segmentation fault).
yangshuo0323 commented 3 years ago

@Justobe

Justobe commented 3 years ago

@yangshuo0323 Sorry, I did not meet similar trouble like that. The exception of my script was thrown when I used mxnet as the backend of Keras.