Closed sebpuetz closed 4 years ago
I reran the example code with parallel_iterations=1
to keep the print statements in order. This reveals that the input shapes to _TileGrad
(url to code) are correct and that the incorrect shapes are introduced in the transpose_op.
I modified _TileGrad
to output the various shapes inside the method:
import tensorflow as tf
@ops.RegisterGradient("Tile")
def _TileGrad(op, grad):
"""Sum reduces grad along the tiled dimensions."""
input_shape = array_ops.shape(op.inputs[0])
# We interleave multiples and input_shape to get split_shape,
# reshape grad to split_shape, and reduce along all even
# dimensions (the tiled dimensions) to get the result
# with shape input_shape. For example
# input_shape = [20, 30, 40]
# multiples = [2, 3, 4]
# split_shape = [2, 20, 3, 30, 4, 40]
# axes = [0, 2, 4]
with tf.control_dependencies([tf.print("input_shape:\n", input_shape)]):
stack = array_ops.stack([op.inputs[1], input_shape])
with tf.control_dependencies([tf.print("inputs[1]:\n", op.inputs[1])]):
transpose = array_ops.transpose(stack)
with tf.control_dependencies([tf.print("stack:\n", stack)]):
transpose = tf.identity(transpose)
with tf.control_dependencies([tf.print("transpose:\n", transpose)]):
split_shape = array_ops.reshape(transpose, [-1])
with tf.control_dependencies([tf.print("split_shape:\n", split_shape)]):
axes = math_ops.range(0, array_ops.size(split_shape), 2)
# Sum reduces grad along the first dimension for IndexedSlices
if isinstance(grad, ops.IndexedSlices):
grad = math_ops.unsorted_segment_sum(
grad.values,
math_ops.mod(grad.indices, input_shape[0]),
input_shape[0])
split_shape = array_ops.concat([[1], split_shape[1:]], axis=0)
input_grad = math_ops.reduce_sum(array_ops.reshape(grad, split_shape), axes)
# Fix shape inference
if not context.executing_eagerly():
input_grad.set_shape(op.inputs[0].get_shape())
return [input_grad, None]
input_shape
refers to the shape of the tensor to be tiled, inputs[1]
refers to the multiples argument of the tile_op
.
Iteration before the exception:
input_shape:
[1 40 25]
inputs[1]:
[50 1 1]
stack:
[[50 1 1]
[1 40 25]]
transpose:
[[50 1]
[1 40]
[1 25]]
split_shape:
[50 1 1 40 1 25]
Iteration causing the exception:
input_shape:
[1 41 25]
inputs[1]:
[50 1 1]
stack:
[[50 1 1]
[1 41 25]]
transpose:
[[-1076028028 -1108964952]
[1071455618 1069826757]
[1038518630 -1077656891]]
split_shape:
[-1076028028 -1108964952 1071455618 1069826757 1038518630 -1077656891]
Stacktrace without ipython:
Traceback (most recent call last):
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call
return fn(*args)
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Size 0 must be non-negative, not -1076028028
[[{{node gradients/clause_logits/Tile_grad/Reshape_1}} = Reshape[T=DT_FLOAT, Tshape=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](gradients/clause_logits/out_mul/Reshape_grad/Reshape, gradients/clause_logits/Tile_grad/Reshape)]]
[[{{node gradients/clause_logits/Tile_grad/stack/_45}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_359_gradients/clause_logits/Tile_grad/stack", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^_cloopgradients/clause_logits/Tile_grad/StringFormat/_1)]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "bug.py", line 41, in <module>
_ = sess.run([train], {y: targets, hs: rand_hs})
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 929, in run
run_metadata_ptr)
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1152, in _run
feed_dict_tensor, options, run_metadata)
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run
run_metadata)
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Size 0 must be non-negative, not -1076028028
[[node gradients/clause_logits/Tile_grad/Reshape_1 (defined at bug.py:30) = Reshape[T=DT_FLOAT, Tshape=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](gradients/clause_logits/out_mul/Reshape_grad/Reshape, gradients/clause_logits/Tile_grad/Reshape)]]
[[{{node gradients/clause_logits/Tile_grad/stack/_45}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_359_gradients/clause_logits/Tile_grad/stack", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^_cloopgradients/clause_logits/Tile_grad/StringFormat/_1)]]
Caused by op 'gradients/clause_logits/Tile_grad/Reshape_1', defined at:
File "bug.py", line 35, in <module>
train, y, hs = build()
File "bug.py", line 30, in build
train = tf.train.AdamOptimizer(0.005).minimize(loss)
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/training/optimizer.py", line 400, in minimize
grad_loss=grad_loss)
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/training/optimizer.py", line 519, in compute_gradients
colocate_gradients_with_ops=colocate_gradients_with_ops)
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py", line 674, in gradients
unconnected_gradients)
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py", line 864, in _GradientsHelper
lambda: grad_fn(op, *out_grads))
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py", line 409, in _MaybeCompile
return grad_fn() # Exit early
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py", line 864, in <lambda>
lambda: grad_fn(op, *out_grads))
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/array_grad.py", line 599, in _TileGrad
input_grad = math_ops.reduce_sum(array_ops.reshape(grad, split_shape), axes)
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 6482, in reshape
"Reshape", tensor=tensor, shape=shape, name=name)
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
return func(*args, **kwargs)
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
op_def=op_def)
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1770, in __init__
self._traceback = tf_stack.extract_stack()
...which was originally created as op 'clause_logits/Tile', defined at:
File "bug.py", line 35, in <module>
train, y, hs = build()
File "bug.py", line 25, in build
logits = tf.while_loop(loop_cond_dist, loop_body_dist, loop_vars, name='clause_logits', parallel_iterations=1)[1]
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3295, in while_loop
return_same_structure)
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3007, in BuildLoop
pred, body, original_loop_vars, loop_vars, shape_invariants)
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2942, in _BuildLoop
body_result = body(*packed_vars_for_body)
File "bug.py", line 11, in loop_body_dist
dists = tf.tile(dists, [tf.shape(hs)[0], 1, 1]) #Error seems to happen in gradients for this op
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 8805, in tile
"Tile", input=input, multiples=multiples, name=name)
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
return func(*args, **kwargs)
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
op_def=op_def)
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1770, in __init__
self._traceback = tf_stack.extract_stack()
InvalidArgumentError (see above for traceback): Size 0 must be non-negative, not -1076028028
[[node gradients/clause_logits/Tile_grad/Reshape_1 (defined at bug.py:30) = Reshape[T=DT_FLOAT, Tshape=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](gradients/clause_logits/out_mul/Reshape_grad/Reshape, gradients/clause_logits/Tile_grad/Reshape)]]
[[{{node gradients/clause_logits/Tile_grad/stack/_45}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_359_gradients/clause_logits/Tile_grad/stack", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^_cloopgradients/clause_logits/Tile_grad/StringFormat/_1)]]
Hi @sebpuetz , could you provide the log with the following command:
sudo /opt/rocm/bin/rocminfo
rocminfo:
=====================
HSA System Attributes
=====================
Runtime Version: 1.1
System Timestamp Freq.: 1000.000000MHz
Sig. Max Wait Duration: 18446744073709551615 (number of timestamp)
Machine Model: LARGE
System Endianness: LITTLE
==========
HSA Agents
==========
*******
Agent 1
*******
Name: AMD Ryzen 7 2700X Eight-Core Processor
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0
Queue Min Size: 0
Queue Max Size: 0
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 32768KB
Chip ID: 0
Cacheline Size: 64
Max Clock Frequency (MHz):3700
BDFID: 0
Compute Unit: 16
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 49448920KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Acessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 49448920KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Acessible by all: TRUE
ISA Info:
N/A
*******
Agent 2
*******
Name: gfx906
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128
Queue Min Size: 4096
Queue Max Size: 131072
Queue Type: MULTI
Node: 1
Device Type: GPU
Cache Info:
L1: 16KB
Chip ID: 26287
Cacheline Size: 64
Max Clock Frequency (MHz):1802
BDFID: 10240
Compute Unit: 60
Features: KERNEL_DISPATCH
Fast F16 Operation: FALSE
Wavefront Size: 64
Workgroup Max Size: 1024
Workgroup Max Size Per Dimension:
Dim[0]: 67109888
Dim[1]: 671089664
Dim[2]: 0
Grid Max Size: 4294967295
Waves Per CU: 40
Max Work-item Per CU: 2560
Grid Max Size per Dimension:
Dim[0]: 4294967295
Dim[1]: 4294967295
Dim[2]: 4294967295
Max number Of fbarriers Per Workgroup:32
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 16760832KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Acessible by all: FALSE
Pool 2
Segment: GROUP
Size: 64KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Alignment: 0KB
Acessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx906
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Dimension:
Dim[0]: 67109888
Dim[1]: 1024
Dim[2]: 16777217
Workgroup Max Size: 1024
Grid Max Dimension:
x 4294967295
y 4294967295
z 4294967295
Grid Max Size: 4294967295
FBarrier Max Size: 32
*** Done ***
timesteps = np.random.randint(low=1, high=150) targets = np.random.randint(low=0, high=2, size=[50, int((timesteps*timesteps-timesteps)/2)])
If timesteps == 1
, doesn't that mean targets would have shape [50, 0]
?
Edit: The script still fails even if you increase the lower bound. The error is very hard to reproduce at a consistent step, even when setting np.random.seed
and tf.set_random_seed
. Setting parallel_iterations=1
in the call to tf.while_loop
really slows down the process of checking if the code will fail. I'll let it run for awhile longer.
If
timesteps == 1
, doesn't that mean targets would have shape[50, 0]
?
That would be the case if the maximum sequence length in a batch would be 1.
This example is narrowed down quite a bit, in the original model I am doing a binary classification of the unique combinations of the output states of a rnn. The maximum sequence length should always be greater than 0 and isn't randomized in the original code. If that is causing the exception, it should be possible to set timesteps = 1
and check if it fails immediately.
Edit: The script still fails even if you increase the lower bound. The error is very hard to reproduce at a consistent step, even when setting
np.random.seed
andtf.set_random_seed
. Settingparallel_iterations=1
in the call totf.while_loop
really slows down the process of checking if the code will fail. I'll let it run for awhile longer.
With parallel_iterations
set to a higher value it sometimes took up to 25 minutes for the error to occur. Thanks for looking into this, too!
@sebpuetz, were you able to produce a failure with parallel_iterations=1
? I ran the code overnight (>800k steps with parallel_iterations=1
on Vega FE) and could not produce a failure. Perhaps the problem is a direct consequence of parallel execution of the while loop?
I also assumed that the parallelization plays a role, but the script failed with parallel_iterations=1
, too. That is where I got the output that made me assume something is going wrong in the transpose:
input_shape:
[1 41 25]
inputs[1]:
[50 1 1]
stack:
[[50 1 1]
[1 41 25]]
transpose:
[[-1076028028 -1108964952]
[1071455618 1069826757]
[1038518630 -1077656891]]
split_shape:
[-1076028028 -1108964952 1071455618 1069826757 1038518630 -1077656891]
I did strip off some more from the script, with parallel_iterations=200
one time it crashed after a little over 10 minutes, another time after just 2 minutes and at times it doesn't crash after more than 20 minutes. The randomness of this bug is rather frustrating, it's hard to tell whether I found the cause or if it's just not encountering the exception by chance
import tensorflow as tf
import numpy as np
def loop_cond_dist(i, _l, _x):
return tf.less(i, 200)
def loop_body_dist(i, l, x):
lookup = tf.nn.embedding_lookup(x, tf.clip_by_value(tf.range(1, limit=200 - i + 1), 0, 24))
cur = tf.tile(tf.expand_dims(lookup, axis=0), [50, 1, 1]) #Error seems to happen in gradients for this op
pre_pad = tf.zeros([50, 19900 - tf.reduce_sum(tf.range(200 - i + 1)), 2])
post_pad = tf.zeros([50, tf.reduce_sum(tf.range(200 - i)), 2])
cur = tf.concat([pre_pad, cur, post_pad], axis=1)
return i + 1, tf.add(l, cur), x
def build():
x = tf.get_variable("x", dtype=tf.float32, shape=[25, 2])
logits = tf.zeros([50, int((200*200-200)/2), 2])
loop_vars = [1, logits, x]
logits = tf.while_loop(loop_cond_dist, loop_body_dist, loop_vars, parallel_iterations=200)[1]
targets = tf.placeholder(tf.int32)
loss = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=targets, logits=logits)
train = tf.train.AdamOptimizer(0.005).minimize(loss)
return train, targets
if __name__ == "__main__":
with tf.Session() as sess:
train, y = build()
sess.run([tf.global_variables_initializer()])
while True:
targets = np.random.randint(low=0, high=2, size=[50, int((200*200-200)/2)])
_ = sess.run([train], {y: targets})
Edit: Changing the script to only expand, tile and add a tensor does not seem to reproduce this bug. The changed script has been running for more than three hours with 200 parallel iterations and hasn't thrown an exception yet.
Don't mean to nag, but can someone from AMD chime in on how to figure out what's going on here, or more importantly, how to work around this bug?
Hi @sebpuetz , I'm trying to reproduce the issue, will update when I got more clues.
Hi @sebpuetz , using your original code I was not able to repro any failures within 5 hours on two dev nodes.
However, with your updated script using parallel_iterations=200
, one node can fail randomly but mostly within 30 mins; while the other server node was able to run correctly over the night.
We're reviewing the related implementations, please stay tuned.
@sebpuetz , could you try to place tf.reduce_sum
to the CPU, e.g.:
def loop_body_dist(i, l, x):
lookup = tf.nn.embedding_lookup(x, tf.clip_by_value(tf.range(1, limit=200 - i + 1), 0, 24))
cur = tf.tile(tf.expand_dims(lookup, axis=0), [50, 1, 1]) #Error seems to happen in gradients for this op
with tf.device("/cpu:0"):
pre_pad = tf.zeros([50, 19900 - tf.reduce_sum(tf.range(200 - i + 1)), 2])
post_pad = tf.zeros([50, tf.reduce_sum(tf.range(200 - i)), 2])
cur = tf.concat([pre_pad, cur, post_pad], axis=1)
return i + 1, tf.add(l, cur), x
I'm not able to repro the failure so far with the above workaround.
@sunway513, running this now, I'll reply with results again later.
Your suggestion didn't break after running for roughly 40 minutes, I then tried a different version that doesn't contain the reduce_sum
which crashed after 30 minutes with a shape error. I'm now running your workaround again to see if it will break eventually.
import tensorflow as tf
import numpy as np
def loop_cond_dist(i, _l, _x):
return tf.less(i, 200)
def loop_body_dist(i, l, x):
lookup = tf.nn.embedding_lookup(x, tf.zeros(200 - i, dtype=tf.int32))
cur = tf.tile(tf.expand_dims(lookup, axis=0), [50, 1, 1]) #Error seems to happen in gradients for this op
return i + 1, tf.concat([l, cur], axis=1), x
def build():
x = tf.get_variable("x", dtype=tf.float32, shape=[200, 2])
logits = tf.zeros([50, 0, 2])
loop_vars = [tf.constant(1), logits, x]
shape_invariants = [tf.TensorShape(None), tf.TensorShape([50, None, 2]), tf.TensorShape([200, 2])]
logits = tf.while_loop(loop_cond_dist, loop_body_dist, loop_vars, shape_invariants=shape_invariants, parallel_iterations=200)[1]
targets = tf.placeholder(tf.int32)
loss = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=targets, logits=logits)
train = tf.train.AdamOptimizer(0.005).minimize(loss)
return train, targets
if __name__ == "__main__":
with tf.Session() as sess:
train, y = build()
sess.run([tf.global_variables_initializer()])
while True:
targets = np.random.randint(low=0, high=2, size=[50, 19900])
_ = sess.run([train], {y: targets})
@sunway513 / @sebpuetz I think I'll have to be able to reproduce it first before making further assumptions. Thus far I'm yet able to reproduce it on my boxes yet....
Some initial analysis: I compared number GPU kernels used, and the histograms of various versions of Python scripts from @sebpuetz in this ticket. Generally speaking they all use the same set of GPU kernels, just with a different intensity.
Let me check the commit # where about rocPRIM
being used in r1.12-rocm
release, versus the latest one on develop-upstream
branch, and also check the implementation of array_ops.transpose()
.
I'm yet able to reproduce the issue so I can only offer my theory thus far:
Based on logs from @sebpuetz it seems the result of tf.transpose()
within _TileGrad
gets corrupted somehow.
Based on the provided tests, the actual GPU kernel to implement tf.transpose()
is:
https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/blob/r1.12-rocm/tensorflow/core/kernels/conv_ops_gpu_3.cu.cc#L184
However the logic for this kernel is simple enough it shouldn't cause any trouble. The only explanation I can think of thus far is there are other GPU kernels which happens to have OOB memory access and polluted GPU VRAM used by this kernel.
The exception doesn't seem to be restricted to _TileGrad
, but it does seem to always happen inside the tf.while
during some reshaping. Although I'm assuming that the program spends most of the time in that loop, which should increase the odds of failing there.
This is the loop body of the program I have been using yesterday and today, it finally crashed after cumulatively more than 5 hours of runtime while reshaping during a forward pass.
edit: Another observation is that after it had not failed for such a long run time, it now fails within minutes after starting (8 times within the first epoch, which would be somewhere between 3 and 5 minutes). The exceptions occurred in different ops, too. Twice in out_mul
, once in non_lin_mul
and five times during gradient calculation in different tile_ops
.
def loop_body_dist(i, l, hs, nonlin_weights, nonlin_bias, out_weights, out_bias, dist_lookup):
prec = tf.tile(tf.expand_dims(hs[:, i - 1, :], axis=1), [1, tf.shape(hs)[1] - i, 1])
dists = tf.nn.embedding_lookup(dist_lookup, tf.clip_by_value(tf.range(1, limit=tf.shape(hs)[1] - i + 1), 0, 50))
dists = tf.tile(tf.expand_dims(dists, axis=0), [tf.shape(hs)[0], 1, 1])
concat = tf.concat([prec, hs[:, i:, :], dists], axis=-1)
nonlin_out = tf.add(tf.einsum('ijk,kl -> ijl', concat, nonlin_weights, name="non_lin_mul"), nonlin_bias, name="non_lin_bias_add")
nonlin_out = tf.nn.relu(nonlin_out)
cur = tf.add(tf.einsum('ijk,kl -> ijl', nonlin_out, out_weights, name="out_mul"), out_bias, name="out_bias")
i += 1
return i, tf.concat([l, cur], axis=1), hs, nonlin_weights, nonlin_bias, out_weights, out_bias, dist_lookup
InvalidArgumentError
InvalidArgumentError (see above for traceback): Input to reshape is a tensor with 450000 values, but the requested shape has 9000
[[node model/clause/clause_logits/out_mul/Reshape (defined at /home/seb/.cargo/toponn/python/toponn/nn/rnn_model.py:206) = Reshape[T=DT_FLOAT, Tshape=DT_INT32, _class=["loc:@train...ul_1/f_acc"], _device="/job:localhost/replica:0/task:0/device:GPU:0"](model/clause/clause_logits/Relu, model/clause/clause_logits/out_mul/Reshape/shape)]]
[[{{node model/clause/SparseSoftmaxCrossEntropyWithLogits/assert_equal/All/_151}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_2528_model/clause/SparseSoftmaxCrossEntropyWithLogits/assert_equal/All", tensor_type=DT_BOOL, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Wondering if there is anything new on this bug?
Hi @sebpuetz , we just rolled out TF1.13, could you give that a shot? We have difficulties reliably reproduce the issue, still trying to root cause it.
Hi @sebpuetz , we recently identified the AdamOptimizer for GFX803 can potentially causing converging issues - which might potentially be the root cause of this random failure. Could you try to set the operations on CPU?
with tf.device("/cpu:0"):
train = tf.train.AdamOptimizer(0.005).minimize(loss)
Hi, thanks for the update. Do you also suspect the GFX 906 to be affected? Since that would be my GPU. I'll check both tf 1.13 and placing the OP on the CPU tomorrow and check back with my findings.
Thanks @sebpuetz , I don't have evidence the issue can affect GFX906 boards. However, since your issue is very random, it can be helpful to try it out :-)
I replaced Adam
with GradientDescent
and ran into the same problem, I guess this would rule out Adam
as the culprit in this issue?
I have yet to install tf 1.13 and run the script there.
Thanks @sebpuetz , the information helps.
1.13 also did not magically solve the issue:
2019-03-13 09:52:44.125063: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at reshape_op.h:51 : Invalid argument: Size 0 must be non-negative, not -1043333120
Traceback (most recent call last):
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call
return fn(*args)
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Size 0 must be non-negative, not -1043333120
[[{{node gradients/while/Tile_grad/Reshape_1}}]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "bug.py", line 39, in <module>
_ = sess.run([train], {y: targets})
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 929, in run
run_metadata_ptr)
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1152, in _run
feed_dict_tensor, options, run_metadata)
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run
run_metadata)
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Size 0 must be non-negative, not -1043333120
[[node gradients/while/Tile_grad/Reshape_1 (defined at bug.py:29) ]]
Caused by op 'gradients/while/Tile_grad/Reshape_1', defined at:
File "bug.py", line 35, in <module>
train, y = build()
File "bug.py", line 29, in build
train = tf.train.GradientDescentOptimizer(0.005).minimize(loss)
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/training/optimizer.py", line 403, in minimize
grad_loss=grad_loss)
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/training/optimizer.py", line 512, in compute_gradients
colocate_gradients_with_ops=colocate_gradients_with_ops)
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py", line 664, in gradients
unconnected_gradients)
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py", line 965, in _GradientsHelper
lambda: grad_fn(op, *out_grads))
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py", line 420, in _MaybeCompile
return grad_fn() # Exit early
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py", line 965, in <lambda>
lambda: grad_fn(op, *out_grads))
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/array_grad.py", line 590, in _TileGrad
input_grad = math_ops.reduce_sum(array_ops.reshape(grad, split_shape), axes)
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 7179, in reshape
"Reshape", tensor=tensor, shape=shape, name=name)
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
op_def=op_def)
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1801, in __init__
self._traceback = tf_stack.extract_stack()
...which was originally created as op 'while/Tile', defined at:
File "bug.py", line 35, in <module>
train, y = build()
File "bug.py", line 23, in build
logits = tf.while_loop(loop_cond_dist, loop_body_dist, loop_vars, parallel_iterations=200)[1]
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3556, in while_loop
return_same_structure)
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3087, in BuildLoop
pred, body, original_loop_vars, loop_vars, shape_invariants)
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3022, in _BuildLoop
body_result = body(*packed_vars_for_body)
File "bug.py", line 11, in loop_body_dist
cur = tf.tile(tf.expand_dims(lookup, axis=0), [50, 1, 1]) #Error seems to happen in gradients for this op
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 10105, in tile
"Tile", input=input, multiples=multiples, name=name)
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
op_def=op_def)
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1801, in __init__
self._traceback = tf_stack.extract_stack()
InvalidArgumentError (see above for traceback): Size 0 must be non-negative, not -1043333120
[[node gradients/while/Tile_grad/Reshape_1 (defined at bug.py:29) ]]
Running the script on ROCm 2.2
/ tf 1.13
still throws this exception.
It seems like failure is more likely when I'm running some resource intensive task on the CPU. When running the script and e.g. compiling something or stressing the CPU on multiple cores, the exception gets thrown rather quickly. The script was running without issue for quite some time and threw an exception when I compiled a Rust project (this put all threads/cores to 100% according to htop
). Afterwards, running the script without much going on in the background didn't cause any issues. Starting a compilation job / some other CPU-stressing job was then pretty much immediately followed by an exception (tried this 4-5 times, error followed within ~30s after stressing the CPU).
This is purely based on my observations and there's not really much to support it but my experience with the crashes. Maybe someone else can reproduce the same findings, or it might as well just be by chance.
Hi @sebpuetz , thanks for the updated description. Can you provide the specs on your local system?
CPU: Ryzen 2700x MB: ASRock X470 Gaming K4 RAM: Corsair CMK16GX4M2B3000C15 (2x8GB) and CMK32GX4M2B3000C15 (2x16GB) GPU: Radeon VII
Should be all that's relevant?
Let me try to reproduce your observations locally.
@sebpuetz , I've tried to repro your issue on my local dev node (TR1950x + RadeonVII) by running your sample and compiling HCC compiler with 32 threads concurrently for three times and was not able to repro the failure. I can confirm the CPU utilization rates are all 100% for my 32 threads on system while running the tests. Tensorflow will by default create its thread pool and map to all the visible CPU resources. Could you try to limit the TF CPU usage with the following patch? Hope that can help improve your system stability:
diff --git a/test.py b/test.py
index 5f36234..9455627 100644
--- a/test.py
+++ b/test.py
@@ -27,7 +27,7 @@ def build():
if __name__ == "__main__":
- with tf.Session() as sess:
+ with tf.Session(config=tf.ConfigProto(inter_op_parallelism_threads=1, intra_op_parallelism_threads=1)) as sess:
train, y = build()
sess.run([tf.global_variables_initializer()])
while True:
Finally got around to test this. After roughly an hour of stressing the CPU and running the script, still no crash.
Thanks @sebpuetz , glad the proposed modification helps in your local environment and use cases. I'll close this issue for now, please let us know if you have further questions.
@sunway513 While this seems to be a valid workaround for the bug, there should be an underlying problem as both @pricebenjamin and you were able to (at least sometimes) reproduce it. This implies that it's not due to my local environment but rather an issue in rocm-tf. Is the official solution then to not use op-parallelism or is the bug still being looked into? Thanks for the help so far.
Hi @sebpuetz , I've not been able to reproduce the issue with TF1.13+ROCm2.2 stack so far. We don't see an efficient way to triage when the issue cannot be reproducible reliably, thanks for your understanding.
Besides, we are constantly improving the software stability by expanding our test coverage and consolidating the unit tests, hope that can also help avoid such issues for the long run.
Please feel free to reopen this issue if you have other concerns or there's a more reliable way to reproduce the problem.
Fwiw, inter_op_parallelism_threads
seems to be the culprit. Setting that value to some large number makes it pretty easy to reproduce (without stressing the CPU) for me locally. Failure happened within the first 5000 steps several times.
The bug seems insensitive wrt. intra_op_parallelism_threads
.
Besides, we are constantly improving the software stability by expanding our test coverage and consolidating the unit tests, hope that can also help avoid such issues for the long run.
Guess I (or anyone else encountering this bug) have to hope that someone accidentally stumbles over the cause.
@sebpuetz , let me experiment with inter_op_parallelism_threads
settings against this issue. As I've mentioned, we need reliable steps to reproduce the issue to further triaging :-)
Thanks for reopening this. If I find time in the next days, I'll also try to get some more insights. @twuebi owns the same GPU and ran into the same issues with ROCm 2.1 and tf 1.12 on an entirely different hardware. I'll try to get some additional information from him whether he can reproduce the issues on ROCm 2.2 and tf 1.13.
Thank you @sebpuetz .
As an update, I've tried to set inter_op_parallelism_threads
as 1024, the sample has been running for over 4 hours and still moving. I'll keep it running over the night.
What was the "large number" you've set to get the sample reliabily crashed within 5000 steps?
I set that number to 128 which is quite a lot more than threads available on my machine. Running this last night threw exceptions quite reliably for me (5-6 runs with inter_op_parallelism_threads = 128
and intra_op_parallelism_threads = 1
). I also tested the opposite settings with inter_op_parallelism_threads = 1
and intra_op_parallelism_threads = 128
but couldn't observe any failures. I hope that wasn't just coincidence. @twuebi said he would chime in tomorrow with results from his machine.
Thanks again.
Great if we can have more data points :-) Thanks!
Hi,
I can reproduce the issue on Mint Linux (kernel 5.0.0-rc6) + tf-1.13.1 + rocm 2.2.31 with this file.
I ran four instances of the program in parallel, three crashed by now after 13.2k, 18.9k and 28.6k steps, the fourth is as of now at step 34.1k and still running.
The following files contain the respective crash messages:
core_dumped.txt traceback_1.txt traceback_2.txt
To update from my side, the sample is still running fine on my local setup with inter_op_parallelism_threads=1024
, till now it has been running for ~24 hours.
Thank you @twuebi , your data point is very helpful. Looks like your script is slightly different than the one posted in the original description. Let me try yours and see how the result goes. One other difference is the system configuration. You are using the upstream kernel (which might not contain the up-to-date firmware for RadeonVII), while I'm using the 4.15 kernel + ROCm2.2.31 rock-dkms.
I gave it another go on Ubuntu 18.04 with 4.15.18 kernel and ROCm 2.2.31 rock-dkms and got the shape error after 673 steps and in another run after 5221 steps using the script @twuebi provided.
@sebpuetz @twuebi could you help try to set the following environment variable and see if the issue can still be reproducible on your end?
export HIP_LAUNCH_BLOCKING=1
Related doc in HIP:
https://github.com/ROCm-Developer-Tools/HIP/blob/cde661142552c9f81d6ddb6fdff5ec87ad4dc9e3/docs/markdown/hip_debugging.md#chicken-bits
Currently running two instances in parallel, both are still going at more than 14000 steps. Before testing with the environment variable, two instances crashed within the first 2000 steps.
I used Ubuntu 18.04 with kernel 4.15 and ROCm 2.2 for this.
Thanks to @sebpuetz, please feel free to post any further updates. Do you see any visible performance degradation?
Judging by steps per time, the flag makes the script run at about half speed
Once again a quick update about ROCm 2.3 + tf 1.13.2
: the bug did not disappear.
alias drun='sudo docker run -it --network=host --device=/dev/kfd --device=/dev/dri --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v $HOME/dockerx:/dockerx -v /data:/data'
drun rocm/tensorflow:rocm2.3-tf1.13-python3
cd /data/bug
python3 bug.py
Failed in two instances after 626 and 3939 steps. stacktrace1.txt stacktrace2.txt
I tried to get some more information by running the instructions over at https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/issues/302#issuecomment-458706456:
export HCC_SERIALIZE_KERNEL=0x3
export HCC_SERIALIZE_COPY=0x3
export HIP_TRACE_API=0x2
Three instances didn't crash within the first 20000 steps. I started another instance without those flags which again crashed within 700 steps.
I also tested whether I can simplify the script further and still observe crashes, the following script crashed twice after ~8000 steps while the second script didn't fail in two runs where each of them ran for about 30000 steps.
# Crashed after about 8000 steps.
import tensorflow as tf
def loop_cond_dist(i, _l, _x):
return tf.less(i, 200)
def loop_body_dist(i, l, x):
lookup = tf.nn.embedding_lookup(x, tf.range(1, limit=200 - i + 1))
cur = tf.tile(tf.expand_dims(lookup, axis=0), [50, 1, 1]) #Error seems to happen in gradients for this op
post_pad = tf.zeros([50, 19900-tf.shape(cur)[1], 2])
cur = tf.concat([post_pad, cur], axis=1)
return i + 1, tf.add(l, cur), x
def build():
x = tf.get_variable("x", dtype=tf.float32, shape=[200, 2])
logits = tf.zeros([50, 19900, 2])
loop_vars = [tf.constant(1), logits, x]
logits = tf.while_loop(loop_cond_dist, loop_body_dist, loop_vars, parallel_iterations=2000)[1]
train = tf.train.GradientDescentOptimizer(0.005).minimize(logits)
return train
if __name__ == "__main__":
config = tf.ConfigProto(inter_op_parallelism_threads=200, intra_op_parallelism_threads=1)
config.gpu_options.allow_growth = True
with tf.Session(config=config) as sess:
step = 0
train = build()
sess.run([tf.global_variables_initializer()])
try:
while True:
step += 1
_ = sess.run([train])
if step % 100 == 0:
print(step)
except Exception as exc:
print(exc)
print(step)
Did not fail:
# No crashes observed after more than 30000 steps
import tensorflow as tf
def loop_cond_dist(i, _l, _x):
return tf.less(i, 200)
def loop_body_dist(i, l, x):
lookup = tf.nn.embedding_lookup(x, tf.range(1, limit=200))
cur = tf.tile(tf.expand_dims(lookup, axis=0), [50, 1, 1]) #Error seems to happen in gradients for this op
post_pad = tf.zeros([50, 19900-tf.shape(cur)[1], 2])
cur = tf.concat([post_pad, cur], axis=1)
return i + 1, tf.add(l, cur), x
def build():
x = tf.get_variable("x", dtype=tf.float32, shape=[200, 2])
logits = tf.zeros([50, 19900, 2])
loop_vars = [tf.constant(1), logits, x]
logits = tf.while_loop(loop_cond_dist, loop_body_dist, loop_vars, parallel_iterations=2000)[1]
train = tf.train.GradientDescentOptimizer(0.005).minimize(logits)
return train
if __name__ == "__main__":
config = tf.ConfigProto(inter_op_parallelism_threads=200, intra_op_parallelism_threads=1)
config.gpu_options.allow_growth = True
with tf.Session(config=config) as sess:
step = 0
train = build()
sess.run([tf.global_variables_initializer()])
try:
while True:
step += 1
_ = sess.run([train])
if step % 100 == 0:
print(step)
except Exception as exc:
print(exc)
print(step)
Update:
I also run into the issue with the docker provided in #316 while running a graph containing this subgraph.
I got a possibly related error within the #316 docker when running a similar graph with tf.where + tf.gather_nd and no while loop:
thread 'main' panicked at 'Cannot run graph: {inner:0x55a7903c1620, InvalidArgument: WhereOp: Race condition between counting the number of true elements and writing them. When counting, saw 27308456 elements; but when writing their indices, saw 223737 elements. [[{{node model/clause/Where}}]] [[train/gradients/model/clause/clause_loss_grad/floordiv_1/_363]]}', src/libcore/result.rs:1009:5
I attempted to replicate the bug with a graph along these lines but I couldn't observe another instance.
hidden = tf.get_variable(shape=[50,55,100],dtype=tf.float32,name="var")
dot = tf.matmul(hidden, hidden, transpose_b=True)
tril = tf.linalg.LinearOperatorLowerTriangular(tf.ones(tf.shape(dot)[-2:])).to_dense()
tril = tf.cast(tril, tf.bool)
tril = tf.logical_not(tril)
s = tf.where(tf.tile(tf.expand_dims(tril,0), [tf.shape(r)[0], 1, 1]))
gather = tf.gather_nd(r, s)
train = tf.train.GradientDescentOptimizer(0.005).minimize(gather)
This is a picture of the subgraph containing the failing tf.where, the input to the failed op is again a tile operation.
Memory access fault by GPU node-1 (Agent handle: 0x558d2db05d80) on address 0x7f0bbda28000. Reason: Page not present or supervisor privilege.
Update 2:
Following the temperature related findings of #414, I attempted to train another instance of a while-loopless network from rust with fixed fan-speeds (75%, ~50-60°C). It crashed after 15 training epochs with a shape related error:
thread 'main' panicked at 'Cannot run graph: {inner:0x556ae3111200, InvalidArgument: Input to reshape is a tensor with 7552 values, but the requested shape has 0 [[{{node train/gradients/model/base_model/layer_3/self_attn/LayerNorm/moments/mean_grad/Reshape}}]] [[{{node train/grad_norm_op}}]]}', src/libcore/result.rs:997:5
I am now looking into reproducing the bug with this script.
edit: Important point I missed to mention: I did not encounter this issue with CUDA backend.
Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template
System information
You can collect some of this information using our environment capture script You can also obtain the TensorFlow version with python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"
Describe the current behavior After training a model for a variable number of epochs, the program throws an exception because of inco,patible shapes during gradient calculation for a tile op inside a tf.while_loop. The exception occurs inside the
_TileGrad
method, which interleaves the multiples and the shapes of the original tile op by stacking, transposing and reshaping. From the behaviour that I could see by printing the input tensors and intermediate steps in_TileGrad
, it seems that something goes wrong during the interleaving. The interleaved shape at times ends up as nonsense like:[949434578 -1198049073 1 16 1 25]
, while something like[50 1 1 21 1 25]
would be expected.The output of the transpose at one of these exceptions was:
resulting in the following interleaved shape:
[1036548730 1061580315 -1110934980 -1085778476 -1085903306 1061705196]
I wasn't able to find the related stack output or input shapes, so I can't tell if the shape error is caused by something further upstream. My reply to this issue includes an example with
parallel_iterations=1
, including all the steps.A full stacktrace can be found at the bottom of this issue.
The error is somewhat hard to reproduce and seems to happen at random. I don't believe it is directly related to tf.while_loop as the exception never occured in an RNN layer.
Describe the expected behavior No
InvalidArgumentError
during gradient calculation.Code to reproduce the issue I ran this code for about 25 minutes before the exception happened. It might not be the minimal code required to reproduce the error, but since it's not reliably reproducable I can't narrow it down easily.
Provide a reproducible test case that is the bare minimum necessary to generate the problem.
Other info / logs
Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.