'_TfDeviceCaptureOp' object has no attribute 'node_def'

arainbilal commented 4 years ago

Overview: This issue occurs when calling conv2d_transpose in upsample_layer under layers.py System config: Tensorflow 1.15.2, Ubuntu 18.04, CUDA 10.1, Bazel 0.26.1

=========== Beginning of decoder============
decoder
upsample
unpool1
Traceback (most recent call last):
  File "cnn_train.py", line 186, in <module>
    net.train()
  File "src/bonnet/train_py/arch/abstract_net.py", line 1133, in train
    logits_train, _, _ = self.build_graph(img_pl, True)  # train
  File "src/bonnet/train_py/arch/bonnet.py", line 246, in build_graph
    data_format=data_format)
  File "src/bonnet/train_py/arch/layer.py", line 217, in upsample_layer
    trainable=train)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/contrib/framework/python/ops/arg_scope.py", line 182, in func_with_args
    return func(*args, **current_args)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/contrib/layers/python/layers/layers.py", line 1417, in convolution2d_transpose
    outputs = layer.apply(inputs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 324, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/base_layer.py", line 1700, in apply
    return self.__call__(inputs, *args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/layers/base.py", line 548, in __call__
    outputs = super(Layer, self).__call__(inputs, *args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/base_layer.py", line 854, in __call__
    outputs = call_fn(cast_inputs, *args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/impl/api.py", line 237, in wrapper
    raise e.ag_error_metadata.to_exception(e)
AttributeError: in converted code:

    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/layers/convolutional.py:835 call
        dilation_rate=self.dilation_rate)
    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/backend.py:4803 conv2d_transpose
        x, tf_data_format = _preprocess_conv2d_input(x, data_format, force_transpose)
    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/backend.py:4624 _preprocess_conv2d_input
        if not _has_nchw_support() or force_transpose:
    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/backend.py:646 _has_nchw_support
        explicitly_on_cpu = _is_current_explicit_device('CPU')
    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/backend.py:615 _is_current_explicit_device
        device = _get_current_tf_device()
    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/backend.py:595 _get_current_tf_device
        graph._apply_device_functions(op)
    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:4393 _apply_device_functions
        device_string = device_spec.string_merge(op)
    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:144 string_merge
        return compat.as_str(_device_string(self.function(node_def)))
    /media/bilal/DATA1/src/bonnet/train_py/arch/abstract_net.py:1056 _assign
        node_def = op if isinstance(op, tf.compat.v1.NodeDef) else op.node_def

    AttributeError: '_TfDeviceCaptureOp' object has no attribute 'node_def'

Source of this error: I think it has something to do with the assigning op on GPU. I have created the following test to verify if I can use the GPU/CPU to assign variables:

def assign_to_device(device, ps_device):
    VAR_OPS = ['Variable', 'VariableV2', 'AutoReloadVariable', 'MutableHashTable', 'MutableHashTableOfTensors', 'MutableDenseHashTable']

    def _assign(op):
        node_def = op if isinstance(op, tf.NodeDef) else op.node_def
        if node_def.op in VAR_OPS:
            return ps_device
        else:
            return device
        return _assign

# with tf.device(assign_to_device('/gpu:0', '/cpu:0')):
with tf.device(assign_to_device('/gpu:0', '/cpu:0')):
    M = tf.get_variable('M', shape=[10,8], dtype=tf.float32)
    x = tf.get_variable('x', shape=[8, 1], dtype=tf.float32)
    y = tf.matmul(M, x)

The test code works fine. Any help/guidance to solve this issue will be appreciated. I can confirm that this issue not occur when using the Tensorflow 1.9. It only happens after upgrade to 1.15.2.

tano297 commented 4 years ago

Hello, Unfortunately, I have no idea why this may be going on. It seems to be a tensorflow related problem though. When I first developed bonnet I had to do the multi-gpu training myself, but now tensorflow has a lot of neat tools to do this automatically. I imagine that from 1.9 to 1.15 there was a breaking in backward compatibility of some API, but unfortunately I am not aware of this, since I have been working with pytorch for the last 2 years. Sorry for not being able to be of more help :/

arainbilal commented 4 years ago

Thanks; Certainly it looks like a TF issue. I know similar one when upgrading from 1.13 to 1.14 about the device query. I am interested to investigate further and will report if I could find the reason or may be end up in using some other version. There is a slight difference in my setup. I am using TF 1.15 compiled with Cuda 10.1. Usually, this version is tested with Cuda 10.0. I don't see any apparent problem there though but I have few more tests to do. I will close this issue after having some more results or conclusions and may be helpful for someone else.

arainbilal commented 4 years ago

Closing remarks: I have tested my TF/CUDA setup with Deeplab and found no problems there. This means that I might have to customize few things in the Bonnet to keep the backward compatibility. Since, these local changes could not be generalized for any other setup or Bonnet users, therefore I am closing this issue.

PRBonn / bonnet

'_TfDeviceCaptureOp' object has no attribute 'node_def' #65