ROCm / tensorflow-upstream

TensorFlow ROCm port
https://tensorflow.org
Apache License 2.0
687 stars 93 forks source link

Seemingly random shape error during gradient calculation #325

Closed sebpuetz closed 4 years ago

sebpuetz commented 5 years ago

edit: Important point I missed to mention: I did not encounter this issue with CUDA backend.

Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

System information

You can collect some of this information using our environment capture script You can also obtain the TensorFlow version with python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"

Describe the current behavior After training a model for a variable number of epochs, the program throws an exception because of inco,patible shapes during gradient calculation for a tile op inside a tf.while_loop. The exception occurs inside the _TileGrad method, which interleaves the multiples and the shapes of the original tile op by stacking, transposing and reshaping. From the behaviour that I could see by printing the input tensors and intermediate steps in _TileGrad, it seems that something goes wrong during the interleaving. The interleaved shape at times ends up as nonsense like: [949434578 -1198049073 1 16 1 25] , while something like [50 1 1 21 1 25] would be expected.

The output of the transpose at one of these exceptions was:

 [[1036548730 1061580315]
 [-1110934980 -1085778476]
 [-1085903306 1061705196]]

resulting in the following interleaved shape: [1036548730 1061580315 -1110934980 -1085778476 -1085903306 1061705196]

I wasn't able to find the related stack output or input shapes, so I can't tell if the shape error is caused by something further upstream. My reply to this issue includes an example with parallel_iterations=1, including all the steps.

A full stacktrace can be found at the bottom of this issue.

The error is somewhat hard to reproduce and seems to happen at random. I don't believe it is directly related to tf.while_loop as the exception never occured in an RNN layer.

Describe the expected behavior No InvalidArgumentError during gradient calculation.

Code to reproduce the issue I ran this code for about 25 minutes before the exception happened. It might not be the minimal code required to reproduce the error, but since it's not reliably reproducable I can't narrow it down easily.

import tensorflow as tf
import numpy as np

def loop_cond_dist(i, _l, hs, __ow, _dist):
    return tf.less(i, tf.shape(hs)[1])

def loop_body_dist(i, l, hs, out_weights, dist_lookup):
    dists = tf.nn.embedding_lookup(dist_lookup, tf.clip_by_value(tf.range(1, limit=tf.shape(hs)[1] - i + 1), 0, 50))
    dists = tf.expand_dims(dists, axis=0)
    dists = tf.tile(dists, [tf.shape(hs)[0], 1, 1]) #Error seems to happen in gradients for this op
    cur = tf.einsum('ijk,kl -> ijl', dists, out_weights, name="out_mul")
    pre_pad = tf.zeros([tf.shape(l)[0], tf.shape(l)[1] - tf.reduce_sum(tf.range(tf.shape(hs)[1] - i + 1)), 2])
    post_pad = tf.zeros([tf.shape(l)[0], tf.reduce_sum(tf.range(tf.shape(hs)[1] - i)), 2])
    cur = tf.concat([pre_pad, cur, post_pad], axis=1)
    i += 1
    return i, tf.add(l, cur), hs, out_weights, dist_lookup

def build():
    dist_lookup = tf.get_variable('distance_embeds', dtype=tf.float32, shape=[51, 25])
    hs = tf.placeholder(dtype=tf.float32, shape=[None, None, 50])
    out_weights = tf.get_variable('out_weights', dtype=tf.float32, shape=[25, 2])
    logits = tf.zeros([50, tf.cast(((tf.shape(hs)[1] * tf.shape(hs)[1]) - tf.shape(hs)[1]) / 2, dtype=tf.float32), 2])
    loop_vars = [1, logits, hs, out_weights, dist_lookup]
    logits = tf.while_loop(loop_cond_dist, loop_body_dist, loop_vars, name='clause_logits')[1]

    targets = tf.placeholder(tf.int32)

    loss = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=targets, logits=logits)
    train = tf.train.AdamOptimizer(0.005).minimize(loss)
    return train, targets, hs

if __name__ == "__main__":
    with tf.Session() as sess:
        train, y, hs = build()
        sess.run([tf.global_variables_initializer()])
        while True:
            timesteps = np.random.randint(low=1, high=150)
            targets = np.random.randint(low=0, high=2, size=[50, int((timesteps*timesteps-timesteps)/2)])
            rand_hs = np.random.rand(50, timesteps, 50)
            _ = sess.run([train], {y: targets, hs: rand_hs})

Provide a reproducible test case that is the bare minimum necessary to generate the problem.

Other info / logs

--------------------------------------------------------------------------
InvalidArgumentError                      Traceback (most recent call last)
~/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args)
   1333     try:
-> 1334       return fn(*args)
   1335     except errors.OpError as e:

~/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/client/session.py in _run_fn(feed_dict, fetch_list, target_list, options, run_metadata)
   1318       return self._call_tf_sessionrun(
-> 1319           options, feed_dict, fetch_list, target_list, run_metadata)
   1320 

~/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/client/session.py in _call_tf_sessionrun(self, options, feed_dict, fetch_list, target_list, run_metadata)
   1406         self._session, options, feed_dict, fetch_list, target_list,
-> 1407         run_metadata)
   1408 

InvalidArgumentError: Size 2 must be non-negative, not -1110934980
     [[{{node gradients/clause_logits/Tile_grad/Reshape_1}} = Reshape[T=DT_FLOAT, Tshape=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](gradients/clause_logits/out_mul/Reshape_grad/Reshape, gradients/clause_logits/Tile_grad/Reshape)]]
     [[{{node gradients/clause_logits/Tile_grad/Identity/_59}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_401_gradients/clause_logits/Tile_grad/Identity", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^_cloopgradients/clause_logits/Tile_grad/StringFormat/_1)]]

During handling of the above exception, another exception occurred:

InvalidArgumentError                      Traceback (most recent call last)
~/.cargo/toponn/python/bug.py in <module>
     45             targets = np.random.randint(low=0, high=2, size=[50, int((timesteps*timesteps-timesteps)/2)])
     46             rand_hs = np.random.rand(50, timesteps, 50)
---> 47             _ = sess.run([train], {y: targets, hs: rand_hs})
     48 

~/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/client/session.py in run(self, fetches, feed_dict, options, run_metadata)
    927     try:
    928       result = self._run(None, fetches, feed_dict, options_ptr,
--> 929                          run_metadata_ptr)
    930       if run_metadata:
    931         proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)

~/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/client/session.py in _run(self, handle, fetches, feed_dict, options, run_metadata)
   1150     if final_fetches or final_targets or (handle and feed_dict_tensor):
   1151       results = self._do_run(handle, final_targets, final_fetches,
-> 1152                              feed_dict_tensor, options, run_metadata)
   1153     else:
   1154       results = []

~/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/client/session.py in _do_run(self, handle, target_list, fetch_list, feed_dict, options, run_metadata)
   1326     if handle is None:
   1327       return self._do_call(_run_fn, feeds, fetches, targets, options,
-> 1328                            run_metadata)
   1329     else:
   1330       return self._do_call(_prun_fn, handle, feeds, fetches)

~/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args)
   1346           pass
   1347       message = error_interpolation.interpolate(message, self._graph)
-> 1348       raise type(e)(node_def, op, message)
   1349 
   1350   def _extend_graph(self):

InvalidArgumentError: Size 2 must be non-negative, not -1110934980
     [[node gradients/clause_logits/Tile_grad/Reshape_1 (defined at /home/seb/.cargo/toponn/python/bug.py:34)  = Reshape[T=DT_FLOAT, Tshape=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](gradients/clause_logits/out_mul/Reshape_grad/Reshape, gradients/clause_logits/Tile_grad/Reshape)]]
     [[{{node gradients/clause_logits/Tile_grad/Identity/_59}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_401_gradients/clause_logits/Tile_grad/Identity", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^_cloopgradients/clause_logits/Tile_grad/StringFormat/_1)]]

Caused by op 'gradients/clause_logits/Tile_grad/Reshape_1', defined at:
  File "/home/seb/.pyenv/versions/3.6.7/bin/ipython", line 10, in <module>
    sys.exit(start_ipython())
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/IPython/__init__.py", line 125, in start_ipython
    return launch_new_instance(argv=argv, **kwargs)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/traitlets/config/application.py", line 657, in launch_instance
    app.initialize(argv)
  File "</home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/decorator.py:decorator-gen-112>", line 2, in initialize
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/traitlets/config/application.py", line 87, in catch_config_error
    return method(app, *args, **kwargs)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/IPython/terminal/ipapp.py", line 323, in initialize
    self.init_code()
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/IPython/core/shellapp.py", line 288, in init_code
    self._run_cmd_line_code()
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/IPython/core/shellapp.py", line 408, in _run_cmd_line_code
    self._exec_file(fname, shell_futures=True)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/IPython/core/shellapp.py", line 340, in _exec_file
    raise_exceptions=True)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2683, in safe_execfile
    self.compile if shell_futures else None)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/IPython/utils/py3compat.py", line 188, in execfile
    exec(compiler(f.read(), fname, 'exec'), glob, loc)

  File "/home/seb/.cargo/toponn/python/bug.py", line 39, in <module>
    train, y, hs = build()
  File "/home/seb/.cargo/toponn/python/bug.py", line 34, in build
    train = tf.train.AdamOptimizer(0.005).minimize(loss)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/training/optimizer.py", line 400, in minimize
    grad_loss=grad_loss)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/training/optimizer.py", line 519, in compute_gradients
    colocate_gradients_with_ops=colocate_gradients_with_ops)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py", line 674, in gradients
    unconnected_gradients)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py", line 864, in _GradientsHelper
    lambda: grad_fn(op, *out_grads))
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py", line 409, in _MaybeCompile
    return grad_fn()  # Exit early
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py", line 864, in <lambda>
    lambda: grad_fn(op, *out_grads))
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/array_grad.py", line 599, in _TileGrad
    input_grad = math_ops.reduce_sum(array_ops.reshape(grad, split_shape), axes)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 6482, in reshape
    "Reshape", tensor=tensor, shape=shape, name=name)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
    return func(*args, **kwargs)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
    op_def=op_def)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1770, in __init__
    self._traceback = tf_stack.extract_stack()

...which was originally created as op 'clause_logits/Tile', defined at:
  File "/home/seb/.pyenv/versions/3.6.7/bin/ipython", line 10, in <module>
    sys.exit(start_ipython())
[elided 10 identical lines from previous traceback]
  File "/home/seb/.cargo/toponn/python/bug.py", line 39, in <module>
    train, y, hs = build()
  File "/home/seb/.cargo/toponn/python/bug.py", line 29, in build
    logits = tf.while_loop(loop_cond_dist, loop_body_dist, loop_vars, name='clause_logits', parallel_iterations=250)[1]
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3295, in while_loop
    return_same_structure)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3007, in BuildLoop
    pred, body, original_loop_vars, loop_vars, shape_invariants)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2942, in _BuildLoop
    body_result = body(*packed_vars_for_body)
  File "/home/seb/.cargo/toponn/python/bug.py", line 13, in loop_body_dist
    dists = tf.tile(dists, [tf.shape(hs)[0], 1, 1])
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 8805, in tile
    "Tile", input=input, multiples=multiples, name=name)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
    return func(*args, **kwargs)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
    op_def=op_def)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1770, in __init__
    self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): Size 2 must be non-negative, not -1110934980
     [[node gradients/clause_logits/Tile_grad/Reshape_1 (defined at /home/seb/.cargo/toponn/python/bug.py:34)  = Reshape[T=DT_FLOAT, Tshape=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](gradients/clause_logits/out_mul/Reshape_grad/Reshape, gradients/clause_logits/Tile_grad/Reshape)]]
     [[{{node gradients/clause_logits/Tile_grad/Identity/_59}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_401_gradients/clause_logits/Tile_grad/Identity", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^_cloopgradients/clause_logits/Tile_grad/StringFormat/_1)]]

Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

sebpuetz commented 5 years ago

I reran the example code with parallel_iterations=1 to keep the print statements in order. This reveals that the input shapes to _TileGrad (url to code) are correct and that the incorrect shapes are introduced in the transpose_op.

I modified _TileGrad to output the various shapes inside the method:

import tensorflow as tf
@ops.RegisterGradient("Tile")
def _TileGrad(op, grad):
  """Sum reduces grad along the tiled dimensions."""
  input_shape = array_ops.shape(op.inputs[0])
  # We interleave multiples and input_shape to get split_shape,
  # reshape grad to split_shape, and reduce along all even
  # dimensions (the tiled dimensions) to get the result
  # with shape input_shape.  For example
  #   input_shape = [20, 30, 40]
  #   multiples = [2, 3, 4]
  #   split_shape = [2, 20, 3, 30, 4, 40]
  #   axes = [0, 2, 4]
  with tf.control_dependencies([tf.print("input_shape:\n", input_shape)]):
    stack = array_ops.stack([op.inputs[1], input_shape])
    with tf.control_dependencies([tf.print("inputs[1]:\n", op.inputs[1])]):
      transpose = array_ops.transpose(stack)
      with tf.control_dependencies([tf.print("stack:\n", stack)]):
        transpose = tf.identity(transpose)
        with tf.control_dependencies([tf.print("transpose:\n", transpose)]):
          split_shape = array_ops.reshape(transpose, [-1])

  with tf.control_dependencies([tf.print("split_shape:\n", split_shape)]):
    axes = math_ops.range(0, array_ops.size(split_shape), 2)
  # Sum reduces grad along the first dimension for IndexedSlices
  if isinstance(grad, ops.IndexedSlices):
    grad = math_ops.unsorted_segment_sum(
        grad.values,
        math_ops.mod(grad.indices, input_shape[0]),
        input_shape[0])
    split_shape = array_ops.concat([[1], split_shape[1:]], axis=0)
  input_grad = math_ops.reduce_sum(array_ops.reshape(grad, split_shape), axes)
  # Fix shape inference
  if not context.executing_eagerly():
    input_grad.set_shape(op.inputs[0].get_shape())
  return [input_grad, None]

input_shape refers to the shape of the tensor to be tiled, inputs[1] refers to the multiples argument of the tile_op.

Iteration before the exception:

input_shape:
 [1 40 25]
inputs[1]:
 [50 1 1]
stack:
 [[50 1 1]
 [1 40 25]]
transpose:
 [[50 1]
 [1 40]
 [1 25]]
split_shape:
 [50 1 1 40 1 25]

Iteration causing the exception:

input_shape:
 [1 41 25]
inputs[1]:
 [50 1 1]
stack:
 [[50 1 1]
 [1 41 25]]
transpose:
 [[-1076028028 -1108964952]
 [1071455618 1069826757]
 [1038518630 -1077656891]]
split_shape:
 [-1076028028 -1108964952 1071455618 1069826757 1038518630 -1077656891]

Stacktrace without ipython:

Traceback (most recent call last):
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call
    return fn(*args)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Size 0 must be non-negative, not -1076028028
     [[{{node gradients/clause_logits/Tile_grad/Reshape_1}} = Reshape[T=DT_FLOAT, Tshape=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](gradients/clause_logits/out_mul/Reshape_grad/Reshape, gradients/clause_logits/Tile_grad/Reshape)]]
     [[{{node gradients/clause_logits/Tile_grad/stack/_45}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_359_gradients/clause_logits/Tile_grad/stack", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^_cloopgradients/clause_logits/Tile_grad/StringFormat/_1)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "bug.py", line 41, in <module>
    _ = sess.run([train], {y: targets, hs: rand_hs})
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 929, in run
    run_metadata_ptr)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run
    run_metadata)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Size 0 must be non-negative, not -1076028028
     [[node gradients/clause_logits/Tile_grad/Reshape_1 (defined at bug.py:30)  = Reshape[T=DT_FLOAT, Tshape=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](gradients/clause_logits/out_mul/Reshape_grad/Reshape, gradients/clause_logits/Tile_grad/Reshape)]]
     [[{{node gradients/clause_logits/Tile_grad/stack/_45}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_359_gradients/clause_logits/Tile_grad/stack", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^_cloopgradients/clause_logits/Tile_grad/StringFormat/_1)]]

Caused by op 'gradients/clause_logits/Tile_grad/Reshape_1', defined at:
  File "bug.py", line 35, in <module>
    train, y, hs = build()
  File "bug.py", line 30, in build
    train = tf.train.AdamOptimizer(0.005).minimize(loss)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/training/optimizer.py", line 400, in minimize
    grad_loss=grad_loss)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/training/optimizer.py", line 519, in compute_gradients
    colocate_gradients_with_ops=colocate_gradients_with_ops)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py", line 674, in gradients
    unconnected_gradients)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py", line 864, in _GradientsHelper
    lambda: grad_fn(op, *out_grads))
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py", line 409, in _MaybeCompile
    return grad_fn()  # Exit early
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py", line 864, in <lambda>
    lambda: grad_fn(op, *out_grads))
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/array_grad.py", line 599, in _TileGrad
    input_grad = math_ops.reduce_sum(array_ops.reshape(grad, split_shape), axes)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 6482, in reshape
    "Reshape", tensor=tensor, shape=shape, name=name)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
    return func(*args, **kwargs)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
    op_def=op_def)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1770, in __init__
    self._traceback = tf_stack.extract_stack()

...which was originally created as op 'clause_logits/Tile', defined at:
  File "bug.py", line 35, in <module>
    train, y, hs = build()
  File "bug.py", line 25, in build
    logits = tf.while_loop(loop_cond_dist, loop_body_dist, loop_vars, name='clause_logits', parallel_iterations=1)[1]
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3295, in while_loop
    return_same_structure)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3007, in BuildLoop
    pred, body, original_loop_vars, loop_vars, shape_invariants)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2942, in _BuildLoop
    body_result = body(*packed_vars_for_body)
  File "bug.py", line 11, in loop_body_dist
    dists = tf.tile(dists, [tf.shape(hs)[0], 1, 1]) #Error seems to happen in gradients for this op
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 8805, in tile
    "Tile", input=input, multiples=multiples, name=name)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
    return func(*args, **kwargs)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
    op_def=op_def)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1770, in __init__
    self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): Size 0 must be non-negative, not -1076028028
     [[node gradients/clause_logits/Tile_grad/Reshape_1 (defined at bug.py:30)  = Reshape[T=DT_FLOAT, Tshape=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](gradients/clause_logits/out_mul/Reshape_grad/Reshape, gradients/clause_logits/Tile_grad/Reshape)]]
     [[{{node gradients/clause_logits/Tile_grad/stack/_45}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_359_gradients/clause_logits/Tile_grad/stack", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^_cloopgradients/clause_logits/Tile_grad/StringFormat/_1)]]
sunway513 commented 5 years ago

Hi @sebpuetz , could you provide the log with the following command: sudo /opt/rocm/bin/rocminfo

sebpuetz commented 5 years ago

rocminfo:

=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (number of timestamp)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    AMD Ryzen 7 2700X Eight-Core Processor
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0                                  
  Queue Min Size:          0                                  
  Queue Max Size:          0                                  
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768KB                            
  Chip ID:                 0                                  
  Cacheline Size:          64                                 
  Max Clock Frequency (MHz):3700                               
  BDFID:                   0                                  
  Compute Unit:            16                                 
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    49448920KB                         
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Acessible by all:        TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    49448920KB                         
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Acessible by all:        TRUE                               
  ISA Info:                
    N/A                      
*******                  
Agent 2                  
*******                  
  Name:                    gfx906                             
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128                                
  Queue Min Size:          4096                               
  Queue Max Size:          131072                             
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16KB                               
  Chip ID:                 26287                              
  Cacheline Size:          64                                 
  Max Clock Frequency (MHz):1802                               
  BDFID:                   10240                              
  Compute Unit:            60                                 
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      FALSE                              
  Wavefront Size:          64                                 
  Workgroup Max Size:      1024                               
  Workgroup Max Size Per Dimension:
    Dim[0]:                  67109888                           
    Dim[1]:                  671089664                          
    Dim[2]:                  0                                  
  Grid Max Size:           4294967295                         
  Waves Per CU:            40                                 
  Max Work-item Per CU:    2560                               
  Grid Max Size per Dimension:
    Dim[0]:                  4294967295                         
    Dim[1]:                  4294967295                         
    Dim[2]:                  4294967295                         
  Max number Of fbarriers Per Workgroup:32                                 
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    16760832KB                         
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Acessible by all:        FALSE                              
    Pool 2                   
      Segment:                 GROUP                              
      Size:                    64KB                               
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Acessible by all:        FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx906          
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Dimension: 
        Dim[0]:                  67109888                           
        Dim[1]:                  1024                               
        Dim[2]:                  16777217                           
      Workgroup Max Size:      1024                               
      Grid Max Dimension:      
        x                        4294967295                         
        y                        4294967295                         
        z                        4294967295                         
      Grid Max Size:           4294967295                         
      FBarrier Max Size:       32                                 
*** Done ***            
pricebenjamin commented 5 years ago

timesteps = np.random.randint(low=1, high=150) targets = np.random.randint(low=0, high=2, size=[50, int((timesteps*timesteps-timesteps)/2)])

If timesteps == 1, doesn't that mean targets would have shape [50, 0]?

Edit: The script still fails even if you increase the lower bound. The error is very hard to reproduce at a consistent step, even when setting np.random.seed and tf.set_random_seed. Setting parallel_iterations=1 in the call to tf.while_loop really slows down the process of checking if the code will fail. I'll let it run for awhile longer.

sebpuetz commented 5 years ago

If timesteps == 1, doesn't that mean targets would have shape [50, 0]?

That would be the case if the maximum sequence length in a batch would be 1.

This example is narrowed down quite a bit, in the original model I am doing a binary classification of the unique combinations of the output states of a rnn. The maximum sequence length should always be greater than 0 and isn't randomized in the original code. If that is causing the exception, it should be possible to set timesteps = 1 and check if it fails immediately.

Edit: The script still fails even if you increase the lower bound. The error is very hard to reproduce at a consistent step, even when setting np.random.seed and tf.set_random_seed. Setting parallel_iterations=1 in the call to tf.while_loop really slows down the process of checking if the code will fail. I'll let it run for awhile longer.

With parallel_iterations set to a higher value it sometimes took up to 25 minutes for the error to occur. Thanks for looking into this, too!

pricebenjamin commented 5 years ago

@sebpuetz, were you able to produce a failure with parallel_iterations=1? I ran the code overnight (>800k steps with parallel_iterations=1 on Vega FE) and could not produce a failure. Perhaps the problem is a direct consequence of parallel execution of the while loop?

sebpuetz commented 5 years ago

I also assumed that the parallelization plays a role, but the script failed with parallel_iterations=1, too. That is where I got the output that made me assume something is going wrong in the transpose:

input_shape:
 [1 41 25]
inputs[1]:
 [50 1 1]
stack:
 [[50 1 1]
 [1 41 25]]
transpose:
 [[-1076028028 -1108964952]
 [1071455618 1069826757]
 [1038518630 -1077656891]]
split_shape:
 [-1076028028 -1108964952 1071455618 1069826757 1038518630 -1077656891]

I did strip off some more from the script, with parallel_iterations=200 one time it crashed after a little over 10 minutes, another time after just 2 minutes and at times it doesn't crash after more than 20 minutes. The randomness of this bug is rather frustrating, it's hard to tell whether I found the cause or if it's just not encountering the exception by chance

import tensorflow as tf
import numpy as np

def loop_cond_dist(i, _l, _x):
    return tf.less(i, 200)

def loop_body_dist(i, l, x):
    lookup = tf.nn.embedding_lookup(x, tf.clip_by_value(tf.range(1, limit=200 - i + 1), 0, 24))
    cur = tf.tile(tf.expand_dims(lookup, axis=0), [50, 1, 1]) #Error seems to happen in gradients for this op
    pre_pad = tf.zeros([50, 19900 - tf.reduce_sum(tf.range(200 - i + 1)), 2])
    post_pad = tf.zeros([50, tf.reduce_sum(tf.range(200 - i)), 2])
    cur = tf.concat([pre_pad, cur, post_pad], axis=1)
    return i + 1, tf.add(l, cur), x

def build():
    x = tf.get_variable("x", dtype=tf.float32, shape=[25, 2])
    logits = tf.zeros([50, int((200*200-200)/2), 2])
    loop_vars = [1, logits, x]
    logits = tf.while_loop(loop_cond_dist, loop_body_dist, loop_vars, parallel_iterations=200)[1]
    targets = tf.placeholder(tf.int32)

    loss = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=targets, logits=logits)
    train = tf.train.AdamOptimizer(0.005).minimize(loss)
    return train, targets

if __name__ == "__main__":
    with tf.Session() as sess:
        train, y = build()
        sess.run([tf.global_variables_initializer()])
        while True:
            targets = np.random.randint(low=0, high=2, size=[50, int((200*200-200)/2)])
            _ = sess.run([train], {y: targets})

Edit: Changing the script to only expand, tile and add a tensor does not seem to reproduce this bug. The changed script has been running for more than three hours with 200 parallel iterations and hasn't thrown an exception yet.

sebpuetz commented 5 years ago

Don't mean to nag, but can someone from AMD chime in on how to figure out what's going on here, or more importantly, how to work around this bug?

sunway513 commented 5 years ago

Hi @sebpuetz , I'm trying to reproduce the issue, will update when I got more clues.

sunway513 commented 5 years ago

Hi @sebpuetz , using your original code I was not able to repro any failures within 5 hours on two dev nodes. However, with your updated script using parallel_iterations=200, one node can fail randomly but mostly within 30 mins; while the other server node was able to run correctly over the night. We're reviewing the related implementations, please stay tuned.

sunway513 commented 5 years ago

@sebpuetz , could you try to place tf.reduce_sum to the CPU, e.g.:

def loop_body_dist(i, l, x):
    lookup = tf.nn.embedding_lookup(x, tf.clip_by_value(tf.range(1, limit=200 - i + 1), 0, 24))
    cur = tf.tile(tf.expand_dims(lookup, axis=0), [50, 1, 1]) #Error seems to happen in gradients for this op
    with tf.device("/cpu:0"):
        pre_pad = tf.zeros([50, 19900 - tf.reduce_sum(tf.range(200 - i + 1)), 2])
        post_pad = tf.zeros([50, tf.reduce_sum(tf.range(200 - i)), 2])
    cur = tf.concat([pre_pad, cur, post_pad], axis=1)
    return i + 1, tf.add(l, cur), x

I'm not able to repro the failure so far with the above workaround.

sebpuetz commented 5 years ago

@sunway513, running this now, I'll reply with results again later.

sebpuetz commented 5 years ago

Your suggestion didn't break after running for roughly 40 minutes, I then tried a different version that doesn't contain the reduce_sum which crashed after 30 minutes with a shape error. I'm now running your workaround again to see if it will break eventually.

import tensorflow as tf
import numpy as np

def loop_cond_dist(i, _l, _x):
    return tf.less(i, 200)

def loop_body_dist(i, l, x):
    lookup = tf.nn.embedding_lookup(x, tf.zeros(200 - i, dtype=tf.int32))
    cur = tf.tile(tf.expand_dims(lookup, axis=0), [50, 1, 1]) #Error seems to happen in gradients for this op
    return i + 1, tf.concat([l, cur], axis=1), x

def build():
    x = tf.get_variable("x", dtype=tf.float32, shape=[200, 2])
    logits = tf.zeros([50, 0, 2])
    loop_vars = [tf.constant(1), logits, x]
    shape_invariants = [tf.TensorShape(None), tf.TensorShape([50, None, 2]), tf.TensorShape([200, 2])]
    logits = tf.while_loop(loop_cond_dist, loop_body_dist, loop_vars, shape_invariants=shape_invariants, parallel_iterations=200)[1]
    targets = tf.placeholder(tf.int32)

    loss = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=targets, logits=logits)
    train = tf.train.AdamOptimizer(0.005).minimize(loss)
    return train, targets

if __name__ == "__main__":
    with tf.Session() as sess:
        train, y = build()
        sess.run([tf.global_variables_initializer()])
        while True:
            targets = np.random.randint(low=0, high=2, size=[50, 19900])
            _ = sess.run([train], {y: targets})
whchung commented 5 years ago

@sunway513 / @sebpuetz I think I'll have to be able to reproduce it first before making further assumptions. Thus far I'm yet able to reproduce it on my boxes yet....

Some initial analysis: I compared number GPU kernels used, and the histograms of various versions of Python scripts from @sebpuetz in this ticket. Generally speaking they all use the same set of GPU kernels, just with a different intensity.

Let me check the commit # where about rocPRIM being used in r1.12-rocm release, versus the latest one on develop-upstream branch, and also check the implementation of array_ops.transpose().

whchung commented 5 years ago

I'm yet able to reproduce the issue so I can only offer my theory thus far: Based on logs from @sebpuetz it seems the result of tf.transpose() within _TileGrad gets corrupted somehow. Based on the provided tests, the actual GPU kernel to implement tf.transpose() is: https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/blob/r1.12-rocm/tensorflow/core/kernels/conv_ops_gpu_3.cu.cc#L184

However the logic for this kernel is simple enough it shouldn't cause any trouble. The only explanation I can think of thus far is there are other GPU kernels which happens to have OOB memory access and polluted GPU VRAM used by this kernel.

sebpuetz commented 5 years ago

The exception doesn't seem to be restricted to _TileGrad, but it does seem to always happen inside the tf.while during some reshaping. Although I'm assuming that the program spends most of the time in that loop, which should increase the odds of failing there.

This is the loop body of the program I have been using yesterday and today, it finally crashed after cumulatively more than 5 hours of runtime while reshaping during a forward pass.

edit: Another observation is that after it had not failed for such a long run time, it now fails within minutes after starting (8 times within the first epoch, which would be somewhere between 3 and 5 minutes). The exceptions occurred in different ops, too. Twice in out_mul, once in non_lin_mul and five times during gradient calculation in different tile_ops.

def loop_body_dist(i, l, hs, nonlin_weights, nonlin_bias, out_weights, out_bias, dist_lookup):
    prec = tf.tile(tf.expand_dims(hs[:, i - 1, :], axis=1), [1, tf.shape(hs)[1] - i, 1])

    dists = tf.nn.embedding_lookup(dist_lookup, tf.clip_by_value(tf.range(1, limit=tf.shape(hs)[1] - i + 1), 0, 50))
    dists = tf.tile(tf.expand_dims(dists, axis=0), [tf.shape(hs)[0], 1, 1])
    concat = tf.concat([prec, hs[:, i:, :], dists], axis=-1)

    nonlin_out = tf.add(tf.einsum('ijk,kl -> ijl', concat, nonlin_weights, name="non_lin_mul"), nonlin_bias, name="non_lin_bias_add")
    nonlin_out = tf.nn.relu(nonlin_out)
    cur = tf.add(tf.einsum('ijk,kl -> ijl', nonlin_out, out_weights, name="out_mul"), out_bias, name="out_bias")
    i += 1
    return i, tf.concat([l, cur], axis=1), hs, nonlin_weights, nonlin_bias, out_weights, out_bias, dist_lookup

InvalidArgumentError

InvalidArgumentError (see above for traceback): Input to reshape is a tensor with 450000 values, but the requested shape has 9000
     [[node model/clause/clause_logits/out_mul/Reshape (defined at /home/seb/.cargo/toponn/python/toponn/nn/rnn_model.py:206)  = Reshape[T=DT_FLOAT, Tshape=DT_INT32, _class=["loc:@train...ul_1/f_acc"], _device="/job:localhost/replica:0/task:0/device:GPU:0"](model/clause/clause_logits/Relu, model/clause/clause_logits/out_mul/Reshape/shape)]]
     [[{{node model/clause/SparseSoftmaxCrossEntropyWithLogits/assert_equal/All/_151}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_2528_model/clause/SparseSoftmaxCrossEntropyWithLogits/assert_equal/All", tensor_type=DT_BOOL, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
sebpuetz commented 5 years ago

Wondering if there is anything new on this bug?

sunway513 commented 5 years ago

Hi @sebpuetz , we just rolled out TF1.13, could you give that a shot? We have difficulties reliably reproduce the issue, still trying to root cause it.

sunway513 commented 5 years ago

Hi @sebpuetz , we recently identified the AdamOptimizer for GFX803 can potentially causing converging issues - which might potentially be the root cause of this random failure. Could you try to set the operations on CPU?

with tf.device("/cpu:0"):
    train = tf.train.AdamOptimizer(0.005).minimize(loss)
sebpuetz commented 5 years ago

Hi, thanks for the update. Do you also suspect the GFX 906 to be affected? Since that would be my GPU. I'll check both tf 1.13 and placing the OP on the CPU tomorrow and check back with my findings.

sunway513 commented 5 years ago

Thanks @sebpuetz , I don't have evidence the issue can affect GFX906 boards. However, since your issue is very random, it can be helpful to try it out :-)

sebpuetz commented 5 years ago

I replaced Adamwith GradientDescent and ran into the same problem, I guess this would rule out Adam as the culprit in this issue? I have yet to install tf 1.13 and run the script there.

sunway513 commented 5 years ago

Thanks @sebpuetz , the information helps.

sebpuetz commented 5 years ago

1.13 also did not magically solve the issue:

2019-03-13 09:52:44.125063: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at reshape_op.h:51 : Invalid argument: Size 0 must be non-negative, not -1043333120
Traceback (most recent call last):
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call
    return fn(*args)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Size 0 must be non-negative, not -1043333120
     [[{{node gradients/while/Tile_grad/Reshape_1}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "bug.py", line 39, in <module>
    _ = sess.run([train], {y: targets})
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 929, in run
    run_metadata_ptr)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run
    run_metadata)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Size 0 must be non-negative, not -1043333120
     [[node gradients/while/Tile_grad/Reshape_1 (defined at bug.py:29) ]]

Caused by op 'gradients/while/Tile_grad/Reshape_1', defined at:
  File "bug.py", line 35, in <module>
    train, y = build()
  File "bug.py", line 29, in build
    train = tf.train.GradientDescentOptimizer(0.005).minimize(loss)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/training/optimizer.py", line 403, in minimize
    grad_loss=grad_loss)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/training/optimizer.py", line 512, in compute_gradients
    colocate_gradients_with_ops=colocate_gradients_with_ops)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py", line 664, in gradients
    unconnected_gradients)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py", line 965, in _GradientsHelper
    lambda: grad_fn(op, *out_grads))
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py", line 420, in _MaybeCompile
    return grad_fn()  # Exit early
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py", line 965, in <lambda>
    lambda: grad_fn(op, *out_grads))
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/array_grad.py", line 590, in _TileGrad
    input_grad = math_ops.reduce_sum(array_ops.reshape(grad, split_shape), axes)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 7179, in reshape
    "Reshape", tensor=tensor, shape=shape, name=name)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
    op_def=op_def)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1801, in __init__
    self._traceback = tf_stack.extract_stack()

...which was originally created as op 'while/Tile', defined at:
  File "bug.py", line 35, in <module>
    train, y = build()
  File "bug.py", line 23, in build
    logits = tf.while_loop(loop_cond_dist, loop_body_dist, loop_vars, parallel_iterations=200)[1]
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3556, in while_loop
    return_same_structure)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3087, in BuildLoop
    pred, body, original_loop_vars, loop_vars, shape_invariants)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3022, in _BuildLoop
    body_result = body(*packed_vars_for_body)
  File "bug.py", line 11, in loop_body_dist
    cur = tf.tile(tf.expand_dims(lookup, axis=0), [50, 1, 1]) #Error seems to happen in gradients for this op
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 10105, in tile
    "Tile", input=input, multiples=multiples, name=name)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
    op_def=op_def)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1801, in __init__
    self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): Size 0 must be non-negative, not -1043333120
     [[node gradients/while/Tile_grad/Reshape_1 (defined at bug.py:29) ]]
sebpuetz commented 5 years ago

Running the script on ROCm 2.2 / tf 1.13 still throws this exception.

It seems like failure is more likely when I'm running some resource intensive task on the CPU. When running the script and e.g. compiling something or stressing the CPU on multiple cores, the exception gets thrown rather quickly. The script was running without issue for quite some time and threw an exception when I compiled a Rust project (this put all threads/cores to 100% according to htop). Afterwards, running the script without much going on in the background didn't cause any issues. Starting a compilation job / some other CPU-stressing job was then pretty much immediately followed by an exception (tried this 4-5 times, error followed within ~30s after stressing the CPU).

This is purely based on my observations and there's not really much to support it but my experience with the crashes. Maybe someone else can reproduce the same findings, or it might as well just be by chance.

sunway513 commented 5 years ago

Hi @sebpuetz , thanks for the updated description. Can you provide the specs on your local system?

sebpuetz commented 5 years ago

CPU: Ryzen 2700x MB: ASRock X470 Gaming K4 RAM: Corsair CMK16GX4M2B3000C15 (2x8GB) and CMK32GX4M2B3000C15 (2x16GB) GPU: Radeon VII

Should be all that's relevant?

sunway513 commented 5 years ago

Let me try to reproduce your observations locally.

sunway513 commented 5 years ago

@sebpuetz , I've tried to repro your issue on my local dev node (TR1950x + RadeonVII) by running your sample and compiling HCC compiler with 32 threads concurrently for three times and was not able to repro the failure. I can confirm the CPU utilization rates are all 100% for my 32 threads on system while running the tests. Tensorflow will by default create its thread pool and map to all the visible CPU resources. Could you try to limit the TF CPU usage with the following patch? Hope that can help improve your system stability:

diff --git a/test.py b/test.py
index 5f36234..9455627 100644
--- a/test.py
+++ b/test.py
@@ -27,7 +27,7 @@ def build():

 if __name__ == "__main__":
-    with tf.Session() as sess:
+    with tf.Session(config=tf.ConfigProto(inter_op_parallelism_threads=1, intra_op_parallelism_threads=1)) as sess:
         train, y = build()
         sess.run([tf.global_variables_initializer()])
         while True:
sebpuetz commented 5 years ago

Finally got around to test this. After roughly an hour of stressing the CPU and running the script, still no crash.

sunway513 commented 5 years ago

Thanks @sebpuetz , glad the proposed modification helps in your local environment and use cases. I'll close this issue for now, please let us know if you have further questions.

sebpuetz commented 5 years ago

@sunway513 While this seems to be a valid workaround for the bug, there should be an underlying problem as both @pricebenjamin and you were able to (at least sometimes) reproduce it. This implies that it's not due to my local environment but rather an issue in rocm-tf. Is the official solution then to not use op-parallelism or is the bug still being looked into? Thanks for the help so far.

sunway513 commented 5 years ago

Hi @sebpuetz , I've not been able to reproduce the issue with TF1.13+ROCm2.2 stack so far. We don't see an efficient way to triage when the issue cannot be reproducible reliably, thanks for your understanding.

Besides, we are constantly improving the software stability by expanding our test coverage and consolidating the unit tests, hope that can also help avoid such issues for the long run.

Please feel free to reopen this issue if you have other concerns or there's a more reliable way to reproduce the problem.

sebpuetz commented 5 years ago

Fwiw, inter_op_parallelism_threads seems to be the culprit. Setting that value to some large number makes it pretty easy to reproduce (without stressing the CPU) for me locally. Failure happened within the first 5000 steps several times.

The bug seems insensitive wrt. intra_op_parallelism_threads.

Besides, we are constantly improving the software stability by expanding our test coverage and consolidating the unit tests, hope that can also help avoid such issues for the long run.

Guess I (or anyone else encountering this bug) have to hope that someone accidentally stumbles over the cause.

sunway513 commented 5 years ago

@sebpuetz , let me experiment with inter_op_parallelism_threads settings against this issue. As I've mentioned, we need reliable steps to reproduce the issue to further triaging :-)

sebpuetz commented 5 years ago

Thanks for reopening this. If I find time in the next days, I'll also try to get some more insights. @twuebi owns the same GPU and ran into the same issues with ROCm 2.1 and tf 1.12 on an entirely different hardware. I'll try to get some additional information from him whether he can reproduce the issues on ROCm 2.2 and tf 1.13.

sunway513 commented 5 years ago

Thank you @sebpuetz . As an update, I've tried to set inter_op_parallelism_threads as 1024, the sample has been running for over 4 hours and still moving. I'll keep it running over the night. What was the "large number" you've set to get the sample reliabily crashed within 5000 steps?

sebpuetz commented 5 years ago

I set that number to 128 which is quite a lot more than threads available on my machine. Running this last night threw exceptions quite reliably for me (5-6 runs with inter_op_parallelism_threads = 128 and intra_op_parallelism_threads = 1). I also tested the opposite settings with inter_op_parallelism_threads = 1 and intra_op_parallelism_threads = 128 but couldn't observe any failures. I hope that wasn't just coincidence. @twuebi said he would chime in tomorrow with results from his machine. Thanks again.

sunway513 commented 5 years ago

Great if we can have more data points :-) Thanks!

twuebi commented 5 years ago

Hi,

I can reproduce the issue on Mint Linux (kernel 5.0.0-rc6) + tf-1.13.1 + rocm 2.2.31 with this file.


I ran four instances of the program in parallel, three crashed by now after 13.2k, 18.9k and 28.6k steps, the fourth is as of now at step 34.1k and still running.

The following files contain the respective crash messages:

core_dumped.txt traceback_1.txt traceback_2.txt


rocminfo rocm-dev

sunway513 commented 5 years ago

To update from my side, the sample is still running fine on my local setup with inter_op_parallelism_threads=1024, till now it has been running for ~24 hours.

Thank you @twuebi , your data point is very helpful. Looks like your script is slightly different than the one posted in the original description. Let me try yours and see how the result goes. One other difference is the system configuration. You are using the upstream kernel (which might not contain the up-to-date firmware for RadeonVII), while I'm using the 4.15 kernel + ROCm2.2.31 rock-dkms.

sebpuetz commented 5 years ago

I gave it another go on Ubuntu 18.04 with 4.15.18 kernel and ROCm 2.2.31 rock-dkms and got the shape error after 673 steps and in another run after 5221 steps using the script @twuebi provided.

sunway513 commented 5 years ago

@sebpuetz @twuebi could you help try to set the following environment variable and see if the issue can still be reproducible on your end? export HIP_LAUNCH_BLOCKING=1 Related doc in HIP: https://github.com/ROCm-Developer-Tools/HIP/blob/cde661142552c9f81d6ddb6fdff5ec87ad4dc9e3/docs/markdown/hip_debugging.md#chicken-bits

sebpuetz commented 5 years ago

Currently running two instances in parallel, both are still going at more than 14000 steps. Before testing with the environment variable, two instances crashed within the first 2000 steps.

I used Ubuntu 18.04 with kernel 4.15 and ROCm 2.2 for this.

sunway513 commented 5 years ago

Thanks to @sebpuetz, please feel free to post any further updates. Do you see any visible performance degradation?

sebpuetz commented 5 years ago

Judging by steps per time, the flag makes the script run at about half speed

sebpuetz commented 5 years ago

Once again a quick update about ROCm 2.3 + tf 1.13.2: the bug did not disappear.

alias drun='sudo docker run -it --network=host --device=/dev/kfd --device=/dev/dri --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v $HOME/dockerx:/dockerx -v /data:/data'
drun rocm/tensorflow:rocm2.3-tf1.13-python3
cd /data/bug
python3 bug.py

Failed in two instances after 626 and 3939 steps. stacktrace1.txt stacktrace2.txt

sebpuetz commented 5 years ago

I tried to get some more information by running the instructions over at https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/issues/302#issuecomment-458706456:

export HCC_SERIALIZE_KERNEL=0x3
export HCC_SERIALIZE_COPY=0x3
export HIP_TRACE_API=0x2

Three instances didn't crash within the first 20000 steps. I started another instance without those flags which again crashed within 700 steps.

I also tested whether I can simplify the script further and still observe crashes, the following script crashed twice after ~8000 steps while the second script didn't fail in two runs where each of them ran for about 30000 steps.

# Crashed after about 8000 steps.
import tensorflow as tf

def loop_cond_dist(i, _l, _x):
    return tf.less(i, 200)

def loop_body_dist(i, l, x):
    lookup = tf.nn.embedding_lookup(x, tf.range(1, limit=200 - i + 1))
    cur = tf.tile(tf.expand_dims(lookup, axis=0), [50, 1, 1]) #Error seems to happen in gradients for this op
    post_pad = tf.zeros([50, 19900-tf.shape(cur)[1], 2])
    cur = tf.concat([post_pad, cur], axis=1)
    return i + 1, tf.add(l, cur), x

def build():
    x = tf.get_variable("x", dtype=tf.float32, shape=[200, 2])
    logits = tf.zeros([50, 19900, 2])
    loop_vars = [tf.constant(1), logits, x]
    logits = tf.while_loop(loop_cond_dist, loop_body_dist, loop_vars, parallel_iterations=2000)[1]

    train = tf.train.GradientDescentOptimizer(0.005).minimize(logits)
    return train

if __name__ == "__main__":
    config = tf.ConfigProto(inter_op_parallelism_threads=200, intra_op_parallelism_threads=1)
    config.gpu_options.allow_growth = True
    with tf.Session(config=config) as sess:
        step = 0
        train = build()
        sess.run([tf.global_variables_initializer()])
        try:
            while True:
                step += 1
                _ = sess.run([train])
                if step % 100 == 0:
                    print(step)
        except Exception as exc:
            print(exc)
            print(step)

Did not fail:

# No crashes observed after more than 30000 steps
import tensorflow as tf

def loop_cond_dist(i, _l, _x):
    return tf.less(i, 200)

def loop_body_dist(i, l, x):
    lookup = tf.nn.embedding_lookup(x, tf.range(1, limit=200))
    cur = tf.tile(tf.expand_dims(lookup, axis=0), [50, 1, 1]) #Error seems to happen in gradients for this op
    post_pad = tf.zeros([50, 19900-tf.shape(cur)[1], 2])
    cur = tf.concat([post_pad, cur], axis=1)
    return i + 1, tf.add(l, cur), x

def build():
    x = tf.get_variable("x", dtype=tf.float32, shape=[200, 2])
    logits = tf.zeros([50, 19900, 2])
    loop_vars = [tf.constant(1), logits, x]
    logits = tf.while_loop(loop_cond_dist, loop_body_dist, loop_vars, parallel_iterations=2000)[1]

    train = tf.train.GradientDescentOptimizer(0.005).minimize(logits)
    return train

if __name__ == "__main__":
    config = tf.ConfigProto(inter_op_parallelism_threads=200, intra_op_parallelism_threads=1)
    config.gpu_options.allow_growth = True
    with tf.Session(config=config) as sess:
        step = 0
        train = build()
        sess.run([tf.global_variables_initializer()])
        try:
            while True:
                step += 1
                _ = sess.run([train])
                if step % 100 == 0:
                    print(step)
        except Exception as exc:
            print(exc)
            print(step)
twuebi commented 5 years ago

Update:

  1. I also run into the issue with the docker provided in #316 while running a graph containing this subgraph.

  2. I got a possibly related error within the #316 docker when running a similar graph with tf.where + tf.gather_nd and no while loop:

thread 'main' panicked at 'Cannot run graph: {inner:0x55a7903c1620, InvalidArgument: WhereOp: Race condition between counting the number of true elements and writing them. When counting, saw 27308456 elements; but when writing their indices, saw 223737 elements. [[{{node model/clause/Where}}]] [[train/gradients/model/clause/clause_loss_grad/floordiv_1/_363]]}', src/libcore/result.rs:1009:5

I attempted to replicate the bug with a graph along these lines but I couldn't observe another instance.

    hidden = tf.get_variable(shape=[50,55,100],dtype=tf.float32,name="var")
    dot = tf.matmul(hidden, hidden, transpose_b=True)
    tril = tf.linalg.LinearOperatorLowerTriangular(tf.ones(tf.shape(dot)[-2:])).to_dense()
    tril = tf.cast(tril, tf.bool)
    tril = tf.logical_not(tril)
    s = tf.where(tf.tile(tf.expand_dims(tril,0), [tf.shape(r)[0], 1, 1]))
    gather = tf.gather_nd(r, s)
    train = tf.train.GradientDescentOptimizer(0.005).minimize(gather)

This is a picture of the subgraph containing the failing tf.where, the input to the failed op is again a tile operation.

subgraph

  1. I am also experiencing quite regular Memory access faults in the #316 docker, on upstream kernel 5 with rocm 2.2 and on kernel 4.15 with rocm 2.3 + rock-dkms, I will create another issue for this, but memory issues seem possibly related to this shape mess?

Memory access fault by GPU node-1 (Agent handle: 0x558d2db05d80) on address 0x7f0bbda28000. Reason: Page not present or supervisor privilege.

rocm-dev-update.txt rocminfo-update.txt

twuebi commented 5 years ago

Update 2:

Following the temperature related findings of #414, I attempted to train another instance of a while-loopless network from rust with fixed fan-speeds (75%, ~50-60°C). It crashed after 15 training epochs with a shape related error:

thread 'main' panicked at 'Cannot run graph: {inner:0x556ae3111200, InvalidArgument: Input to reshape is a tensor with 7552 values, but the requested shape has 0 [[{{node train/gradients/model/base_model/layer_3/self_attn/LayerNorm/moments/mean_grad/Reshape}}]] [[{{node train/grad_norm_op}}]]}', src/libcore/result.rs:997:5

I am now looking into reproducing the bug with this script.