dmlc / keras

Deep Learning library for Python. Convnets, recurrent neural networks, and more. Runs on MXNet, Theano or TensorFlow.
http://keras.io/
Other
125 stars 34 forks source link

Unable to use Batchnormalization with multi-GPUs in MXNet backend #63

Open sandeep-krishnamurthy opened 7 years ago

sandeep-krishnamurthy commented 7 years ago

Summary

Unable to use batchnormalization with MXNet backend when using multiple GPUs. After debugging the issue, I found that there is a mismatch in the shape of batchnorm param in KVStore. in mxnet/model.py -> KVStore is being initialized with a (64,) shape but is being tried to update with a (256,64,1,1) shape.

Stacktrace and Debug messages

Below is the stack trace and my debug messages from "initialize_kvstore" and "update_params_on_kvstore" functions. Observe that param shape at index 4, there is a mismatch.

In initialize kvstore

kvstore - <mxnet.kvstore.KVStore object at 0x7fbfdb6729d0> len of param_arrays - 304 len of arg_params - 304 len of param_names - 304 update_on_kvstore - True Index - 0 Param name - normal1 Arg params - <NDArray 3x7x7 @cpu(0)> Index - 1 Param name - convolution2d_1_b Arg params - <NDArray 1 @cpu(0)> Index - 2 Param name - batchnormalization_1_running_mean Arg params - <NDArray 1 @cpu(0)> Index - 3 Param name - batchnormalization_1_running_std Arg params - <NDArray 1 @cpu(0)> Index - 4 Param name - batchnormalization_1_gamma Arg params - <NDArray 1 @cpu(0)> arg_params in idx 4 - <NDArray 64 @cpu(0)> param name at idx 4 - batchnormalization_1_gamma

In update_params_on_kvstore param_arrays - 304 grad_arrays - 304 kvstore - <mxnet.kvstore.KVStore object at 0x7fbfdb6729d0> Index - 0 arg_list[0] <NDArray 64x3x7x7 @gpu(0)> Current index - 0 Index - 1 arg_list[0] <NDArray 64 @gpu(0)> Current index - 1 Index - 2 arg_list[0] <NDArray 64 @gpu(0)> Current index - 2 Index - 3 arg_list[0] <NDArray 64 @gpu(0)> Current index - 3 Index - 4 arg_list[0] <NDArray 256x64x1x1 @gpu(0)> Current index - 4 Len of arg_list - 16 Len of grad_list - 16 Arg list[0] - <NDArray 256x64x1x1 @gpu(0)> Grad list[0] - <NDArray 256x64x1x1 @gpu(0)> [16:14:11] /home/ubuntu/mxnet/dmlc-core/include/dmlc/./logging.h:304: [16:14:11] src/ndarray/ndarray.cc:319: Check failed: from.shape() == to->shape() operands shape mismatchfrom.shape = (256,64,1,1) to.shape=(64,)

Stack trace returned 10 entries: [bt] (0) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7fc120b0a46c] [bt] (1) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(_ZN5mxnet10CopyFromToERKNS_7NDArrayEPS0_i+0x546) [0x7fc12154c056] [bt] (2) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(_ZN5mxnet7kvstore10CommDevice6ReduceEiRKSt6vectorINS_7NDArrayESaIS3_EEi+0x384) [0x7fc121925de4] [bt] (3) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(_ZN5mxnet7kvstore12KVStoreLocal4PushERKSt6vectorIiSaIiEERKS2_INS_7NDArrayESaIS7_EEi+0x175) [0x7fc121928015] [bt] (4) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(MXKVStorePush+0x7b0) [0x7fc1218cbbc0] [bt] (5) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7fc0b1ef8e40] [bt] (6) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x2eb) [0x7fc0b1ef88ab] [bt] (7) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(_ctypes_callproc+0x48f) [0x7fc0ba1083df] [bt] (8) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(+0x11d82) [0x7fc0ba10cd82] [bt] (9) python(PyObject_Call+0x43) [0x4b0cb3]

Traceback (most recent call last): File "/home/ubuntu/keras_benchmarks/test_cifar_resnet.py", line 131, in run_time, memory_usage = profile(train_model) File "/home/ubuntu/keras_benchmarks/profiler.py", line 84, in profile func_to_profile() File "/home/ubuntu/keras_benchmarks/test_cifar_resnet.py", line 125, in train_model validation_data=(X_test, Y_test)) File "/usr/local/lib/python2.7/dist-packages/Keras-1.2.2-py2.7.egg/keras/engine/training.py", line 1559, in fit_generator class_weight=class_weight) File "/usr/local/lib/python2.7/dist-packages/Keras-1.2.2-py2.7.egg/keras/engine/training.py", line 1322, in train_on_batch outputs = self.train_function(ins) File "/usr/local/lib/python2.7/dist-packages/Keras-1.2.2-py2.7.egg/keras/engine/training.py", line 1959, in train_function self._mod.update() File "/usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/module/bucketing_module.py", line 408, in update self._curr_module.update() File "/usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/module/module.py", line 575, in update self._kvstore) File "/usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/model.py", line 132, in _update_params_on_kvstore kvstore.push(index, grad_list, priority=-index) File "/usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/kvstore.py", line 162, in push ctypes.c_int(priority))) File "/usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/base.py", line 85, in check_call raise MXNetError(py_str(_LIB.MXGetLastError())) mxnet.base.MXNetError: [16:14:11] src/ndarray/ndarray.cc:319: Check failed: from.shape() == to->shape() operands shape mismatchfrom.shape = (256,64,1,1) to.shape=(64,)

Note: I used Resnet50 architecture on CIFAR dataset with batchsize=32.

@mli @piiswrong @madjam @bhavinthaker

mli commented 7 years ago

Good catch. We are not going to fix it until letting kvstore to accept str key id instead of int.

On Fri, Jun 2, 2017 at 2:30 PM, Sandeep Krishnamurthy < notifications@github.com> wrote:

Unable to use batchnormalization with MXNet backend when using multiple GPUs. After debugging the issue, I found that there is a mismatch in the shape of batchnorm param in KVStore. in mxnet/model.py -> KVStore is being initialized with a (64,) shape but is being tried to update with a (256,64,1,1) shape.

Below is the stack trace and my debug messages from "initialize_kvstore" and "update_params_on_kvstore" functions. Observe that param shape at index 4, there is a mismatch.

In initialize kvstore

kvstore - <mxnet.kvstore.KVStore object at 0x7fbfdb6729d0> len of param_arrays - 304 len of arg_params - 304 len of param_names - 304 update_on_kvstore - True Index - 0 Param name - normal1 Arg params - <NDArray 3x7x7 @cpu https://github.com/cpu(0)> Index - 1 Param name - convolution2d_1_b Arg params - <NDArray 1 @cpu https://github.com/cpu(0)> Index - 2 Param name - batchnormalization_1_running_mean Arg params - <NDArray 1 @cpu https://github.com/cpu(0)> Index - 3 Param name - batchnormalization_1_running_std Arg params - <NDArray 1 @cpu https://github.com/cpu(0)> Index - 4 Param name - batchnormalization_1_gamma Arg params - <NDArray 1 @cpu https://github.com/cpu(0)> arg_params in idx 4 - <NDArray 64 @cpu https://github.com/cpu(0)> param name at idx 4 - batchnormalization_1_gamma

IIn update_params_on_kvstore param_arrays - 304 grad_arrays - 304 kvstore - <mxnet.kvstore.KVStore object at 0x7fbfdb6729d0> Index - 0 arg_list[0] <NDArray 64x3x7x7 @gpu https://github.com/gpu(0)> Current index - 0 Index - 1 arg_list[0] <NDArray 64 @gpu https://github.com/gpu(0)> Current index - 1 Index - 2 arg_list[0] <NDArray 64 @gpu https://github.com/gpu(0)> Current index - 2 Index - 3 arg_list[0] <NDArray 64 @gpu https://github.com/gpu(0)> Current index - 3 Index - 4 arg_list[0] <NDArray 256x64x1x1 @gpu https://github.com/gpu(0)> Current index - 4 Len of arg_list - 16 Len of grad_list - 16 Arg list[0] - <NDArray 256x64x1x1 @gpu https://github.com/gpu(0)> Grad list[0] - <NDArray 256x64x1x1 @gpu https://github.com/gpu(0)> [16:14:11] /home/ubuntu/mxnet/dmlc-core/include/dmlc/./logging.h:304: [16:14:11] src/ndarray/ndarray.cc:319: Check failed: from.shape() == to->shape() operands shape mismatchfrom.shape = (256,64,1,1) to.shape=(64,)

Stack trace returned 10 entries: [bt] (0) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7. egg/mxnet/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7fc120b0a46c] [bt] (1) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7. egg/mxnet/libmxnet.so(_ZN5mxnet10CopyFromToERKNS_7NDArrayEPS0_i+0x546) [0x7fc12154c056] [bt] (2) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7. egg/mxnet/libmxnet.so(_ZN5mxnet7kvstore10CommDevice6R educeEiRKSt6vectorINS_7NDArrayESaIS3_EEi+0x384) [0x7fc121925de4] [bt] (3) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7. egg/mxnet/libmxnet.so(_ZN5mxnet7kvstore12KVStoreLocal 4PushERKSt6vectorIiSaIiEERKS2_INS_7NDArrayESaIS7_EEi+0x175) [0x7fc121928015] [bt] (4) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7. egg/mxnet/libmxnet.so(MXKVStorePush+0x7b0) [0x7fc1218cbbc0] [bt] (5) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7fc0b1ef8e40] [bt] (6) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x2eb) [0x7fc0b1ef88ab] [bt] (7) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(_ctypes_callproc+0x48f) [0x7fc0ba1083df] [bt] (8) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(+0x11d82) [0x7fc0ba10cd82] [bt] (9) python(PyObject_Call+0x43) [0x4b0cb3]

Traceback (most recent call last): File "/home/ubuntu/keras_benchmarks/test_cifar_resnet.py", line 131, in run_time, memory_usage = profile(train_model) File "/home/ubuntu/keras_benchmarks/profiler.py", line 84, in profile func_to_profile() File "/home/ubuntu/keras_benchmarks/test_cifar_resnet.py", line 125, in train_model validation_data=(X_test, Y_test)) File "/usr/local/lib/python2.7/dist-packages/Keras-1.2.2-py2. 7.egg/keras/engine/training.py", line 1559, in fit_generator class_weight=class_weight) File "/usr/local/lib/python2.7/dist-packages/Keras-1.2.2-py2. 7.egg/keras/engine/training.py", line 1322, in train_on_batch outputs = self.train_function(ins) File "/usr/local/lib/python2.7/dist-packages/Keras-1.2.2-py2. 7.egg/keras/engine/training.py", line 1959, in train_function self._mod.update() File "/usr/local/lib/python2.7/dist-packages/mxnet-0.10.1- py2.7.egg/mxnet/module/bucketing_module.py", line 408, in update self._curr_module.update() File "/usr/local/lib/python2.7/dist-packages/mxnet-0.10.1- py2.7.egg/mxnet/module/module.py", line 575, in update self._kvstore) File "/usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/model.py", line 132, in _update_params_on_kvstore kvstore.push(index, grad_list, priority=-index) File "/usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/kvstore.py", line 162, in push ctypes.c_int(priority))) File "/usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/base.py", line 85, in check_call raise MXNetError(py_str(_LIB.MXGetLastError())) mxnet.base.MXNetError: [16:14:11] src/ndarray/ndarray.cc:319: Check failed: from.shape() == to->shape() operands shape mismatchfrom.shape = (256,64,1,1) to.shape=(64,)

Note: I used Resnet50 architecture on CIFAR dataset with batchsize=32.

@mli https://github.com/mli @piiswrong https://github.com/piiswrong @madjam https://github.com/madjam

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dmlc/keras/issues/63, or mute the thread https://github.com/notifications/unsubscribe-auth/AAZv4RouSXjvHRYK3PWD7Xb2rpKzzdfaks5sAH72gaJpZM4Nu0NN .

sandeep-krishnamurthy commented 7 years ago

Thanks Mu.

Most of the convolution and fully connected networks would require batchnormalization. Can we still go ahead with keras beta release with this as a known issue?

mli commented 7 years ago

another possible fix is that not pushing the batchnorm parameters into kvstore. They actually don't need to be synchronized.

madjam commented 7 years ago

@mli Can you please clarify what is causing this problem?