XifengGuo / CapsNet-Keras

A Keras implementation of CapsNet in NIPS2017 paper "Dynamic Routing Between Capsules". Now test error = 0.34%.
MIT License
2.46k stars 652 forks source link

Multi GPU Training #14

Closed xiaoyongzhu closed 6 years ago

xiaoyongzhu commented 6 years ago

Looks like this repo does not support the latest multi-GPU model which is introduced in Keras 2.0.9. When I do this:


    if(num_gpu > 1):
        model = multi_gpu_model(model, gpus=num_gpu)
    # compile the model
    model.compile(optimizer=optimizers.Adam(lr=args.lr),
                  loss=[margin_loss, 'mse'],
                  loss_weights=[1., args.lam_recon],
                  metrics={'out_caps': 'accuracy'})

It will give me this error, so looks like the input layer does not handle the data well (not sure about this though).


2017-11-10 23:15:25.160851: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla K80, pci bus id: 451d:00:00.0, compute capability: 3.7)
2017-11-10 23:15:25.160892: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:1) -> (device: 1, name: Tesla K80, pci bus id: 7dcb:00:00.0, compute capability: 3.7)
Train on 60000 samples, validate on 10000 samples
2017-11-10 23:15:27.118862: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla K80, pci bus id: 451d:00:00.0, compute capability: 3.7)
2017-11-10 23:15:27.118901: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:1) -> (device: 1, name: Tesla K80, pci bus id: 7dcb:00:00.0, compute capability: 3.7)
Epoch 1/30
2017-11-10 23:15:31.162715: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: Incompatible shapes: [100,1152,10,1,1] vs. [50,1152,10,1,16]
         [[Node: replica_0/model_1/digitcaps/mul = Mul[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](replica_0/model_1/digitcaps/transpose_1, replica_0/model_1/digitcaps/scan/TensorArrayStack/TensorArrayGatherV3)]]
2017-11-10 23:15:31.162970: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: Incompatible shapes: [100,1152,10,1,1] vs. [50,1152,10,1,16]
         [[Node: replica_0/model_1/digitcaps/mul = Mul[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](replica_0/model_1/digitcaps/transpose_1, replica_0/model_1/digitcaps/scan/TensorArrayStack/TensorArrayGatherV3)]]
2017-11-10 23:15:31.167090: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: Incompatible shapes: [100,1152,10,1,1] vs. [50,1152,10,1,16]
         [[Node: replica_0/model_1/digitcaps/mul = Mul[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](replica_0/model_1/digitcaps/transpose_1, replica_0/model_1/digitcaps/scan/TensorArrayStack/TensorArrayGatherV3)]]
2017-11-10 23:15:31.170465: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: Incompatible shapes: [100,1152,10,1,1] vs. [50,1152,10,1,16]
         [[Node: replica_0/model_1/digitcaps/mul = Mul[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](replica_0/model_1/digitcaps/transpose_1, replica_0/model_1/digitcaps/scan/TensorArrayStack/TensorArrayGatherV3)]]
2017-11-10 23:15:31.170701: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: Incompatible shapes: [100,1152,10,1,1] vs. [50,1152,10,1,16]
         [[Node: replica_0/model_1/digitcaps/mul = Mul[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](replica_0/model_1/digitcaps/transpose_1, replica_0/model_1/digitcaps/scan/TensorArrayStack/TensorArrayGatherV3)]]
2017-11-10 23:15:31.175048: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: Incompatible shapes: [100,1152,10,1,1] vs. [50,1152,10,1,16]
         [[Node: replica_0/model_1/digitcaps/mul = Mul[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](replica_0/model_1/digitcaps/transpose_1, replica_0/model_1/digitcaps/scan/TensorArrayStack/TensorArrayGatherV3)]]
Traceback (most recent call last):
  File "/datadrive/xiaoyzhu/python3env/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1323, in _do_call
    return fn(*args)
  File "/datadrive/xiaoyzhu/python3env/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1302, in _run_fn
    status, run_metadata)
  File "/datadrive/xiaoyzhu/python3env/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: Incompatible shapes: [100,1152,10,1,1] vs. [50,1152,10,1,16]
         [[Node: replica_0/model_1/digitcaps/mul = Mul[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](replica_0/model_1/digitcaps/transpose_1, replica_0/model_1/digitcaps/scan/TensorArrayStack/TensorArrayGatherV3)]]
         [[Node: training/Adam/gradients/concatenate_2/concat_grad/Slice_1/_309 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:1", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_2229_training/Adam/gradients/concatenate_2/concat_grad/Slice_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:1"]()]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "capsulenet.py", line 215, in <module>
    train(model=model, data=((x_train, y_train), (x_test, y_test)), args=args)
  File "capsulenet.py", line 113, in train
    validation_data=[[x_test, y_test], [y_test, x_test]], callbacks=[log, tb, checkpoint, lr_decay])
  File "/datadrive/xiaoyzhu/python3env/lib/python3.5/site-packages/keras/engine/training.py", line 1631, in fit
    validation_steps=validation_steps)
  File "/datadrive/xiaoyzhu/python3env/lib/python3.5/site-packages/keras/engine/training.py", line 1213, in _fit_loop
    outs = f(ins_batch)
  File "/datadrive/xiaoyzhu/python3env/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py", line 2332, in __call__
    **self.session_kwargs)
  File "/datadrive/xiaoyzhu/python3env/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 889, in run
    run_metadata_ptr)
  File "/datadrive/xiaoyzhu/python3env/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1120, in _run
    feed_dict_tensor, options, run_metadata)
  File "/datadrive/xiaoyzhu/python3env/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1317, in _do_run
    options, run_metadata)
  File "/datadrive/xiaoyzhu/python3env/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1336, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Incompatible shapes: [100,1152,10,1,1] vs. [50,1152,10,1,16]
         [[Node: replica_0/model_1/digitcaps/mul = Mul[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](replica_0/model_1/digitcaps/transpose_1, replica_0/model_1/digitcaps/scan/TensorArrayStack/TensorArrayGatherV3)]]
         [[Node: training/Adam/gradients/concatenate_2/concat_grad/Slice_1/_309 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:1", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_2229_training/Adam/gradients/concatenate_2/concat_grad/Slice_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:1"]()]]

Caused by op 'replica_0/model_1/digitcaps/mul', defined at:
  File "capsulenet.py", line 215, in <module>
    train(model=model, data=((x_train, y_train), (x_test, y_test)), args=args)
  File "capsulenet.py", line 103, in train
    model = multi_gpu_model(model, gpus=num_gpu)
  File "/datadrive/xiaoyzhu/python3env/lib/python3.5/site-packages/keras/utils/training_utils.py", line 143, in multi_gpu_model
    outputs = model(inputs)
  File "/datadrive/xiaoyzhu/python3env/lib/python3.5/site-packages/keras/engine/topology.py", line 603, in __call__
    output = self.call(inputs, **kwargs)
  File "/datadrive/xiaoyzhu/python3env/lib/python3.5/site-packages/keras/engine/topology.py", line 2061, in call
    output_tensors, _, _ = self.run_internal_graph(inputs, masks)
  File "/datadrive/xiaoyzhu/python3env/lib/python3.5/site-packages/keras/engine/topology.py", line 2212, in run_internal_graph
    output_tensors = _to_list(layer.call(computed_tensor, **kwargs))
  File "/datadrive/xiaoyzhu/RandomExercise/CapsNet-Keras/capsulelayers.py", line 157, in call
    outputs = squash(K.sum(c * inputs_hat, 1, keepdims=True))
  File "/datadrive/xiaoyzhu/python3env/lib/python3.5/site-packages/tensorflow/python/ops/math_ops.py", line 894, in binary_op_wrapper
    return func(x, y, name=name)
  File "/datadrive/xiaoyzhu/python3env/lib/python3.5/site-packages/tensorflow/python/ops/math_ops.py", line 1117, in _mul_dispatch
    return gen_math_ops._mul(x, y, name=name)
  File "/datadrive/xiaoyzhu/python3env/lib/python3.5/site-packages/tensorflow/python/ops/gen_math_ops.py", line 2726, in _mul
    "Mul", x=x, y=y, name=name)
  File "/datadrive/xiaoyzhu/python3env/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/datadrive/xiaoyzhu/python3env/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 2956, in create_op
    op_def=op_def)
  File "/datadrive/xiaoyzhu/python3env/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1470, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): Incompatible shapes: [100,1152,10,1,1] vs. [50,1152,10,1,16]
         [[Node: replica_0/model_1/digitcaps/mul = Mul[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](replica_0/model_1/digitcaps/transpose_1, replica_0/model_1/digitcaps/scan/TensorArrayStack/TensorArrayGatherV3)]]
         [[Node: training/Adam/gradients/concatenate_2/concat_grad/Slice_1/_309 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:1", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_2229_training/Adam/gradients/concatenate_2/concat_grad/Slice_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:1"]()]]

Exception ignored in: <bound method BaseSession.__del__ of <tensorflow.python.client.session.Session object at 0x7f16e115a828>>
Traceback (most recent call last):
  File "/datadrive/xiaoyzhu/python3env/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 696, in __del__
  File "/datadrive/xiaoyzhu/python3env/lib/python3.5/site-packages/tensorflow/python/framework/c_api_util.py", line 30, in __init__
TypeError: 'NoneType' object is not callable
XifengGuo commented 6 years ago

@xiaoyongzhu Thanks for your feedback, I’ll test on multi-gpu later. And welcome to PR if you can solve this.

XifengGuo commented 6 years ago

@xiaoyongzhu I have added the multi-gpu support. Thanks for the feedback.