Unable to use Batchnormalization with multi-GPUs in MXNet backend

dmlc / keras

Deep Learning library for Python. Convnets, recurrent neural networks, and more. Runs on MXNet, Theano or TensorFlow.

Other

125 stars 34 forks source link

Summary

Unable to use batchnormalization with MXNet backend when using multiple GPUs. After debugging the issue, I found that there is a mismatch in the shape of batchnorm param in KVStore. in mxnet/model.py -> KVStore is being initialized with a (64,) shape but is being tried to update with a (256,64,1,1) shape.

Stacktrace and Debug messages

Below is the stack trace and my debug messages from "initialize_kvstore" and "update_params_on_kvstore" functions. Observe that param shape at index 4, there is a mismatch.

In initialize kvstore

kvstore - <mxnet.kvstore.KVStore object at 0x7fbfdb6729d0> len of param_arrays - 304 len of arg_params - 304 len of param_names - 304 update_on_kvstore - True Index - 0 Param name - normal1 Arg params - <NDArray 3x7x7 @cpu(0)> Index - 1 Param name - convolution2d_1_b Arg params - <NDArray 1 @cpu(0)> Index - 2 Param name - batchnormalization_1_running_mean Arg params - <NDArray 1 @cpu(0)> Index - 3 Param name - batchnormalization_1_running_std Arg params - <NDArray 1 @cpu(0)> Index - 4 Param name - batchnormalization_1_gamma Arg params - <NDArray 1 @cpu(0)> arg_params in idx 4 - <NDArray 64 @cpu(0)> param name at idx 4 - batchnormalization_1_gamma

In update_params_on_kvstore param_arrays - 304 grad_arrays - 304 kvstore - <mxnet.kvstore.KVStore object at 0x7fbfdb6729d0> Index - 0 arg_list[0] <NDArray 64x3x7x7 @gpu(0)> Current index - 0 Index - 1 arg_list[0] <NDArray 64 @gpu(0)> Current index - 1 Index - 2 arg_list[0] <NDArray 64 @gpu(0)> Current index - 2 Index - 3 arg_list[0] <NDArray 64 @gpu(0)> Current index - 3 Index - 4 arg_list[0] <NDArray 256x64x1x1 @gpu(0)> Current index - 4 Len of arg_list - 16 Len of grad_list - 16 Arg list[0] - <NDArray 256x64x1x1 @gpu(0)> Grad list[0] - <NDArray 256x64x1x1 @gpu(0)> [16:14:11] /home/ubuntu/mxnet/dmlc-core/include/dmlc/./logging.h:304: [16:14:11] src/ndarray/ndarray.cc:319: Check failed: from.shape() == to->shape() operands shape mismatchfrom.shape = (256,64,1,1) to.shape=(64,)

Stack trace returned 10 entries: [bt] (0) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7fc120b0a46c] [bt] (1) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(_ZN5mxnet10CopyFromToERKNS_7NDArrayEPS0_i+0x546) [0x7fc12154c056] [bt] (2) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(_ZN5mxnet7kvstore10CommDevice6ReduceEiRKSt6vectorINS_7NDArrayESaIS3_EEi+0x384) [0x7fc121925de4] [bt] (3) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(_ZN5mxnet7kvstore12KVStoreLocal4PushERKSt6vectorIiSaIiEERKS2_INS_7NDArrayESaIS7_EEi+0x175) [0x7fc121928015] [bt] (4) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(MXKVStorePush+0x7b0) [0x7fc1218cbbc0] [bt] (5) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7fc0b1ef8e40] [bt] (6) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x2eb) [0x7fc0b1ef88ab] [bt] (7) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(_ctypes_callproc+0x48f) [0x7fc0ba1083df] [bt] (8) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(+0x11d82) [0x7fc0ba10cd82] [bt] (9) python(PyObject_Call+0x43) [0x4b0cb3]

Traceback (most recent call last): File "/home/ubuntu/keras_benchmarks/test_cifar_resnet.py", line 131, in run_time, memory_usage = profile(train_model) File "/home/ubuntu/keras_benchmarks/profiler.py", line 84, in profile func_to_profile() File "/home/ubuntu/keras_benchmarks/test_cifar_resnet.py", line 125, in train_model validation_data=(X_test, Y_test)) File "/usr/local/lib/python2.7/dist-packages/Keras-1.2.2-py2.7.egg/keras/engine/training.py", line 1559, in fit_generator class_weight=class_weight) File "/usr/local/lib/python2.7/dist-packages/Keras-1.2.2-py2.7.egg/keras/engine/training.py", line 1322, in train_on_batch outputs = self.train_function(ins) File "/usr/local/lib/python2.7/dist-packages/Keras-1.2.2-py2.7.egg/keras/engine/training.py", line 1959, in train_function self._mod.update() File "/usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/module/bucketing_module.py", line 408, in update self._curr_module.update() File "/usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/module/module.py", line 575, in update self._kvstore) File "/usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/model.py", line 132, in _update_params_on_kvstore kvstore.push(index, grad_list, priority=-index) File "/usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/kvstore.py", line 162, in push ctypes.c_int(priority))) File "/usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/base.py", line 85, in check_call raise MXNetError(py_str(_LIB.MXGetLastError())) mxnet.base.MXNetError: [16:14:11] src/ndarray/ndarray.cc:319: Check failed: from.shape() == to->shape() operands shape mismatchfrom.shape = (256,64,1,1) to.shape=(64,)

Note: I used Resnet50 architecture on CIFAR dataset with batchsize=32.

@mli @piiswrong @madjam @bhavinthaker

Good catch. We are not going to fix it until letting kvstore to accept str key id instead of int.

On Fri, Jun 2, 2017 at 2:30 PM, Sandeep Krishnamurthy < notifications@github.com> wrote:

Unable to use batchnormalization with MXNet backend when using multiple GPUs. After debugging the issue, I found that there is a mismatch in the shape of batchnorm param in KVStore. in mxnet/model.py -> KVStore is being initialized with a (64,) shape but is being tried to update with a (256,64,1,1) shape.

Below is the stack trace and my debug messages from "initialize_kvstore" and "update_params_on_kvstore" functions. Observe that param shape at index 4, there is a mismatch.

In initialize kvstore

kvstore - <mxnet.kvstore.KVStore object at 0x7fbfdb6729d0> len of param_arrays - 304 len of arg_params - 304 len of param_names - 304 update_on_kvstore - True Index - 0 Param name - normal1 Arg params - <NDArray 3x7x7 @cpu https://github.com/cpu(0)> Index - 1 Param name - convolution2d_1_b Arg params - <NDArray 1 @cpu https://github.com/cpu(0)> Index - 2 Param name - batchnormalization_1_running_mean Arg params - <NDArray 1 @cpu https://github.com/cpu(0)> Index - 3 Param name - batchnormalization_1_running_std Arg params - <NDArray 1 @cpu https://github.com/cpu(0)> Index - 4 Param name - batchnormalization_1_gamma Arg params - <NDArray 1 @cpu https://github.com/cpu(0)> arg_params in idx 4 - <NDArray 64 @cpu https://github.com/cpu(0)> param name at idx 4 - batchnormalization_1_gamma

IIn update_params_on_kvstore param_arrays - 304 grad_arrays - 304 kvstore - <mxnet.kvstore.KVStore object at 0x7fbfdb6729d0> Index - 0 arg_list[0] <NDArray 64x3x7x7 @gpu https://github.com/gpu(0)> Current index - 0 Index - 1 arg_list[0] <NDArray 64 @gpu https://github.com/gpu(0)> Current index - 1 Index - 2 arg_list[0] <NDArray 64 @gpu https://github.com/gpu(0)> Current index - 2 Index - 3 arg_list[0] <NDArray 64 @gpu https://github.com/gpu(0)> Current index - 3 Index - 4 arg_list[0] <NDArray 256x64x1x1 @gpu https://github.com/gpu(0)> Current index - 4 Len of arg_list - 16 Len of grad_list - 16 Arg list[0] - <NDArray 256x64x1x1 @gpu https://github.com/gpu(0)> Grad list[0] - <NDArray 256x64x1x1 @gpu https://github.com/gpu(0)> [16:14:11] /home/ubuntu/mxnet/dmlc-core/include/dmlc/./logging.h:304: [16:14:11] src/ndarray/ndarray.cc:319: Check failed: from.shape() == to->shape() operands shape mismatchfrom.shape = (256,64,1,1) to.shape=(64,)

Stack trace returned 10 entries: [bt] (0) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7. egg/mxnet/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7fc120b0a46c] [bt] (1) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7. egg/mxnet/libmxnet.so(_ZN5mxnet10CopyFromToERKNS_7NDArrayEPS0_i+0x546) [0x7fc12154c056] [bt] (2) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7. egg/mxnet/libmxnet.so(_ZN5mxnet7kvstore10CommDevice6R educeEiRKSt6vectorINS_7NDArrayESaIS3_EEi+0x384) [0x7fc121925de4] [bt] (3) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7. egg/mxnet/libmxnet.so(_ZN5mxnet7kvstore12KVStoreLocal 4PushERKSt6vectorIiSaIiEERKS2_INS_7NDArrayESaIS7_EEi+0x175) [0x7fc121928015] [bt] (4) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7. egg/mxnet/libmxnet.so(MXKVStorePush+0x7b0) [0x7fc1218cbbc0] [bt] (5) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7fc0b1ef8e40] [bt] (6) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x2eb) [0x7fc0b1ef88ab] [bt] (7) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(_ctypes_callproc+0x48f) [0x7fc0ba1083df] [bt] (8) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(+0x11d82) [0x7fc0ba10cd82] [bt] (9) python(PyObject_Call+0x43) [0x4b0cb3]

Traceback (most recent call last): File "/home/ubuntu/keras_benchmarks/test_cifar_resnet.py", line 131, in run_time, memory_usage = profile(train_model) File "/home/ubuntu/keras_benchmarks/profiler.py", line 84, in profile func_to_profile() File "/home/ubuntu/keras_benchmarks/test_cifar_resnet.py", line 125, in train_model validation_data=(X_test, Y_test)) File "/usr/local/lib/python2.7/dist-packages/Keras-1.2.2-py2. 7.egg/keras/engine/training.py", line 1559, in fit_generator class_weight=class_weight) File "/usr/local/lib/python2.7/dist-packages/Keras-1.2.2-py2. 7.egg/keras/engine/training.py", line 1322, in train_on_batch outputs = self.train_function(ins) File "/usr/local/lib/python2.7/dist-packages/Keras-1.2.2-py2. 7.egg/keras/engine/training.py", line 1959, in train_function self._mod.update() File "/usr/local/lib/python2.7/dist-packages/mxnet-0.10.1- py2.7.egg/mxnet/module/bucketing_module.py", line 408, in update self._curr_module.update() File "/usr/local/lib/python2.7/dist-packages/mxnet-0.10.1- py2.7.egg/mxnet/module/module.py", line 575, in update self._kvstore) File "/usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/model.py", line 132, in _update_params_on_kvstore kvstore.push(index, grad_list, priority=-index) File "/usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/kvstore.py", line 162, in push ctypes.c_int(priority))) File "/usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/base.py", line 85, in check_call raise MXNetError(py_str(_LIB.MXGetLastError())) mxnet.base.MXNetError: [16:14:11] src/ndarray/ndarray.cc:319: Check failed: from.shape() == to->shape() operands shape mismatchfrom.shape = (256,64,1,1) to.shape=(64,)

Note: I used Resnet50 architecture on CIFAR dataset with batchsize=32.

@mli https://github.com/mli @piiswrong https://github.com/piiswrong @madjam https://github.com/madjam

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dmlc/keras/issues/63, or mute the thread https://github.com/notifications/unsubscribe-auth/AAZv4RouSXjvHRYK3PWD7Xb2rpKzzdfaks5sAH72gaJpZM4Nu0NN .

dmlc / keras

Unable to use Batchnormalization with multi-GPUs in MXNet backend #63

Summary

Stacktrace and Debug messages