awslabs / sagemaker-debugger

Amazon SageMaker Debugger provides functionality to save tensors during training of machine learning jobs and analyze those tensors
Apache License 2.0
161 stars 83 forks source link

smdebug causes an OperatorNotAllowedInGraphError inside a function decorated with tf.function #398

Closed horietakehiro closed 3 years ago

horietakehiro commented 4 years ago

I encountered an OperatorNotAllowedInGraphError during my training job with my own tensorflow model, which has a function decorated with @tf.function. The traceback told me that smdebug/tensorflow/keras.py causes an OperatorNotAllowedInGraphError. I note below some information.

my training job configuration

class issueReproducer(tf.Module):

def __init__(self, n_unit):
    """
    n_unit : int
    """
    self.variable = tf.Variable(tf.zeros((1, n_unit), dtype=tf.float32))
    self.l1 = tf.keras.layers.Dense(n_unit)
    self.optimizer = tf.optimizers.Adam()

@tf.function
def fit(self, tensor):
    """
    tensor : some tensor of shape : (1, n_unit)
    """
    with tf.GradientTape() as tape:
        output = self.l1(self.variable)
        loss = tf.reduce_sum(output - self.variable)
    grad = tape.gradient(loss, self.variable)
    self.optimizer.apply_gradients([(grad, self.variable)])

    return self.variable

if name == "main":

model = issueReproducer(5)
tensor = tf.constant([[1,2,3,4,5]], dtype=tf.float32)

variable = model.fit(tensor)

print("Returned variable : {}".format(variable))
### Traceback
```Py
Traceback (most recent call last):
  File "issue_reproducer.py", line 34, in <module>
    variable = model.fit(tensor)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 780, in __call__
    result = self._call(*args, **kwds)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 823, in _call
    self._initialize(args, kwds, add_initializers_to=initializers)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 697, in _initialize
    *args, **kwds))
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 2855, in _get_concrete_function_internal_garbage_collected
    graph_function, _, _ = self._maybe_define_function(args, kwargs)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 3213, in _maybe_define_function
    graph_function = self._create_graph_function(args, kwargs)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 3075, in _create_graph_function
    capture_by_value=self._capture_by_value),
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/framework/func_graph.py", line 986, in func_graph_from_py_func
    func_outputs = python_func(*func_args, **func_kwargs)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 600, in wrapped_fn
    return weak_wrapped_fn().__wrapped__(*args, **kwds)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 3735, in bound_method_wrapper
    return wrapped_fn(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/framework/func_graph.py", line 973, in wrapper
    raise e.ag_error_metadata.to_exception(e)
tensorflow.python.framework.errors_impl.OperatorNotAllowedInGraphError: in user code:

    issue_reproducer.py:23 fit  *
        grad = tape.gradient(loss, self.variable)
    /usr/local/lib/python3.7/site-packages/smdebug/tensorflow/keras.py:956 run  **
        (not grads or not vars)
    /usr/local/lib/python3.7/site-packages/tensorflow/python/framework/ops.py:877 __bool__
        self._disallow_bool_casting()
    /usr/local/lib/python3.7/site-packages/tensorflow/python/framework/ops.py:487 _disallow_bool_casting
        "using a `tf.Tensor` as a Python `bool`")
    /usr/local/lib/python3.7/site-packages/tensorflow/python/framework/ops.py:474 _disallow_when_autograph_enabled
        " indicate you are trying to use an unsupported feature.".format(task))

    OperatorNotAllowedInGraphError: using a `tf.Tensor` as a Python `bool` is not allowed: AutoGraph did convert this function. This might indicate you are trying to use an unsupported feature.

I could avoid the error by disabling the sagemaker-debugger's hook initialization like below.

est = TensorFlow(
    entry_point='issue_reproducer.py',
    role=role,
    instance_count=1,
    instance_type='ml.m5.large',
    framework_version='2.3.0',
    py_version='py37',
    debugger_hook_config=False,
    sagemaker_session=sess,
)

Is this a bug? or something wrong with my entry_point script? Thank you for solving.

horietakehiro commented 3 years ago

Sorry, I'v missed the note on the README.

Note: Debugger with zero script change is partially available for TensorFlow v2.1.0. The inputs, outputs, gradients, and layers built-in collections are currently not available for these TensorFlow versions.

Now I understood that current sagemaker-debugger is not available to my custome tensroflow model.