Open fabrahman opened 4 years ago
To "produce a tensor with shape [bs, sl]" from logits
and sample_id
, you may use sequence_sparse_softmax_cross_entropy
and set
average_across_batch=False,
average_across_timesteps=False,
sum_over_batch=False,
sum_over_timesteps=False
--
Another way of doing RL is to use SeqPGAgent
, see examples/seq2seq_rl
Or refer to examples/seqgan to write by your own
To "produce a tensor with shape [bs, sl]" from
logits
andsample_id
, you may usesequence_sparse_softmax_cross_entropy
and setaverage_across_batch=False, average_across_timesteps=False, sum_over_batch=False, sum_over_timesteps=False
-- Another way of doing RL is to use
SeqPGAgent
, see examples/seq2seq_rlOr refer to examples/seqgan to write by your own
Thanks you @ZhitingHu In this regard, just want to double check if the following code is the right approach to first get the logprob tensor ([bs,sl]) and then mask out prefix and padded indices (indices beyond sample_length) to get the batch_loss of shape (bs,):
sample_output, sample_len = decoder(
decoding_strategy='infer_sample',
embedding = _embedding_fn,
context=context_ids,
context_sequence_length=context_len,
max_decoding_length=max_decoding_length,
end_token=end_token)
ids = sample_output.sample_id
logits = sample_output.logits
max_full_len = tf.reduce_max(sample_len)
sampleLogprobs = tx.losses.sequence_sparse_softmax_cross_entropy(
labels=ids[:,1:],
logits=logits,
sequence_length=sample_len - 1, ## question: I am assuming this should mask the the right-paddings of sample, right?
average_across_timesteps=False,
sum_over_timesteps=False,
average_across_batch=False,
sum_over_batch=False)
mask = tf.sequence_mask(
sample_len-1,
dtype=tf.float32)
mask_prefix = 1 - tf.sequence_mask(
context_len-1,
maxlen=max_full_len-1, #max_decoding_length-1,
dtype=tf.float32)
mask = mask * mask_prefix
batch_loss = tx.utils.reduce_with_weights(
tensor=sampleLogprobs,
weights=mask,
average_across_batch=False,
average_across_remaining=True,
sum_over_remaining=False)
So my questions are:
1- is it the right way to mask both prefix and indices beyond sample_length?
2- I should pass sample_length to 'sequence_length' argument of sequence_sparse_softmax_cross_entropy
, right?
I would appreciate if you let me know of there is any mistake in this code?
Thank you so much in advance.
The code looks good. A reference code here (which is basically the same as what you wrote): https://github.com/asyml/texar/issues/147#issuecomment-489442414
2- it's not really necessary cuz you'd do the mask with reduce_with_weights
@ZhitingHu Actually I am getting a OOM error when I add this RL loss the way I showed earlier to mle loss.
MLE loss works fine, part of the code which is generating a text (both sample and greedy for doing self-critical RL) are working fine. The text are generated and I could pass them to my classifier and get the reward. However, when I fetch the loss optimization, it throw following error . While this is not happening when I have multiple MLE loss like here. It is so weird since for computing the RL loss I am using the same sequence_sparse_softmax_cross_entropy
call. Can you help me with that?
I attached part of my code and the error log here:
NOTE that I have a 1080Ti GPU and tried both batch size 2 and 1.
# For RL fine-tuning
def _get_sample_text(context_ids, context_len):
sample_output, sample_len = decoder(
decoding_strategy='infer_sample',
embedding = _embedding_fn,
context=context_ids,
context_sequence_length=context_len,
max_decoding_length=max_decoding_length,
end_token=end_token)
return sample_output, sample_len
def _get_sample_rolled(output, length, context_len):
ids = output.sample_id
ids = tx.utils.varlength_roll(ids, -context_len) # final sample ids rolled
ids_len = length - context_len
ids = ids[:, :tf.reduce_max(ids_len)]
return ids, ids_len
def _get_greedy_text(context_ids, context_len):
greedy_res, greedy_len = decoder(
decoding_strategy='infer_greedy',
embedding=_embedding_fn,
context=context_ids,
context_sequence_length=context_len,
max_decoding_length=max_decoding_length,
end_token=end_token)
greedy_ids = tx.utils.varlength_roll(greedy_res.sample_id, -context_len)
greedy_ids_len = greedy_len - context_len
greedy_ids = greedy_ids[:, :tf.reduce_max(greedy_ids_len)]
return greedy_ids, greedy_ids_len
def compute_batch_loss(output, sample_len, context_len):
max_full_len = tf.reduce_max(sample_len)
ids = output.sample_id[:, :max_full_len]
logits = output.logits[:, :max_full_len] #(bs, sl, vocab)
sampleLogprobs = tx.losses.sequence_sparse_softmax_cross_entropy(
labels=ids[:,1:],
logits=logits[:,:-1,:],
sequence_length=sample_len - 1,
average_across_timesteps=False,
sum_over_timesteps=False,
average_across_batch=False,
sum_over_batch=False)
mask = tf.sequence_mask(
sample_len-1,
dtype=tf.float32)
mask_prefix = 1 - tf.sequence_mask(
context_len-1,
maxlen=max_full_len-1, #max_decoding_length-1,
dtype=tf.float32)
mask = mask * mask_prefix
batch_loss = tx.utils.reduce_with_weights(
tensor=sampleLogprobs,
weights=mask,
average_across_batch=False,
average_across_remaining=True,
sum_over_remaining=False)
return batch_loss
## Loss MLE
x1_len = tf.placeholder(tf.int32, shape=[None], name='x1_len')
x1x4_ids = tf.placeholder(tf.int32, shape=[None, None], name='x1x4_ids')
x1x4_len = tf.placeholder(tf.int32, shape=[None], name='x1x4_len')
loss_mle = _get_recon_loss(x1x4_ids, x1x4_len, x1_len) # similar to the repo I mentioned
## Loss RL
x1_ids = tf.placeholder(tf.int32, shape=[None, None], name='x1_ids')
reward = tf.placeholder_with_default(tf.ones([batch_size]), shape=(config_train.train_batch_size,), name="reward")
symbols_output, symbols_len = _get_sample_text(x1_ids, x1_len) # this works fine and I can run
symbols_rl, len_rl = _get_sample_rolled(symbols_output, symbols_len, x1_len) # this works fine
symbols_gr, len_gr = _get_greedy_text(x1_ids, x1_len) # this works fine
batch_loss_rl = compute_batch_loss(symbols_output, symbols_len, x1_len) # I think adding this to my loss make the problem, but not sure exactly
rl_loss = tf.reduce_mean(batch_loss_rl * reward)
loss = (1 - config_train.w_rl) * loss_mle + config_train.w_rl * rl_loss
error log:
sys.exit(main(argv)) [95/1878]
File "roc_rl_main_refacored.py", line 1001, in main
_train_epoch(sess, epoch==0)
File "roc_rl_main_refacored.py", line 724, in _train_epoch
rets = sess.run(fetches, feed_dict, options=run_opts)
File "/home/hannahbrahman/anaconda3/envs/py36/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 95
0, in run
run_metadata_ptr)
File "/home/hannahbrahman/anaconda3/envs/py36/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 11
73, in _run
feed_dict_tensor, options, run_metadata)
File "/home/hannahbrahman/anaconda3/envs/py36/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 13
50, in _do_run
run_metadata)
File "/home/hannahbrahman/anaconda3/envs/py36/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 13
70, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
(0) Resource exhausted: OOM when allocating tensor with shape[1024,1024] and type float on /job:localhost/replica:0/tas
k:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node swap_in_transformer_decoder_1/layer_15/self_attention/multihead_attention/multihead_attention/value/Ten
sordot_1/MatMul_1}}]]
Current usage from device: /job:localhost/replica:0/task:0/device:GPU:0, allocator: GPU_0_bfc
196.32MiB from transpose
196.32MiB from OptimizeLoss/gradients/transformer_decoder_1/MatMul_grad/MatMul_1
31.65MiB from OptimizeLoss/gradients/transformer_decoder_1/layer_19/past_poswise_ln/ffn/conv1/Tensordot/MatMul_grad/Mat
Mul_1
30.59MiB from swap_in_transformer_decoder_1/layer_23/past_poswise_ln/ffn/conv1/Tensordot_1/MatMul_1
30.27MiB from swap_in_transformer_decoder_1/layer_17/past_poswise_ln/ffn/conv1/Tensordot_1/MatMul_1
28.67MiB from swap_in_transformer_decoder_1/layer_19/past_poswise_ln/ffn/conv1/Tensordot_1/MatMul_1
26.20MiB from swap_in_transformer_decoder_1/layer_16/past_poswise_ln/ffn/conv2/Tensordot_1/MatMul_1
24.44MiB from swap_in_transformer_decoder_1/layer_16/past_poswise_ln/ffn/conv1/Tensordot_1/MatMul_1
24.00MiB from swap_in_transformer_decoder_1/layer_14/past_poswise_ln/ffn/conv1/Tensordot_1/MatMul_1
24.00MiB from swap_in_transformer_decoder_1/layer_14/past_poswise_ln/ffn/conv2/Tensordot_1/MatMul_1
24.00MiB from swap_in_transformer_decoder_1/layer_15/past_poswise_ln/ffn/conv1/Tensordot_1/MatMul_1
24.00MiB from swap_in_transformer_decoder_1/layer_15/past_poswise_ln/ffn/conv2/Tensordot_1/MatMul_1
23.38MiB from swap_in_transformer_decoder_1/layer_13/past_poswise_ln/ffn/conv2/Tensordot_1/MatMul_1
22.34MiB from swap_in_transformer_decoder_1/layer_13/past_poswise_ln/ffn/conv1/Tensordot_1/MatMul_1
20.75MiB from OptimizeLoss/gradients/transformer_decoder_1/layer_12/past_poswise_ln/ffn/conv2/Tensordot/MatMul_grad/Mat
Mul_1
20.00MiB from swap_in_transformer_decoder_1/layer_20/past_poswise_ln/ffn/conv2/Tensordot_1/MatMul_1
17.51MiB from OptimizeLoss/gradients/transformer_decoder_1/layer_4/past_poswise_ln/ffn/conv2/Tensordot/MatMul_grad/MatM
ul_1
I couldn't see the why here. What's in the fetches
here?
File "roc_rl_main_refacored.py", line 724, in _train_epoch
rets = sess.run(fetches, feed_dict, options=run_opts)
If optimization (e.g,, train_op
) is included: would OOM still happen if you exclude train_op
from fetches
? This is to see if it's the loss computation that caused OOM. Similarly, would you try omit loss_mle
all together and see if it's still OOM?
Actually, the code will work fine if I have only loss_mle, it is working even when I have multiple loss_mle (which means the sequence_sparse_softmax_cross_entropy
is called several time for each loss_mle, similar to this code). However, once I add the rl_loss to the train_op it gives the OOM error. So the question is what am I doing wrong about the loss_rl since it is basically a call to sequence_sparse_softmax_cross_entropy
function ( more details about how I compute loss_rl is in my previous post)
Here is the fetches:
loss = (1 - config_train.w_rl) * loss_mle + config_train.w_rl * rl_loss
train_op = tf.contrib.layers.optimize_loss(
loss=loss,
global_step=global_step,
learning_rate=None,
optimizer=opt,
variables=trainable_variables)
while training:
reward_fetches = {
'sample_rl': symbols_rl,
'sample_len': len_rl,
'greedy_sym': symbols_gr,
'greedy_len': len_gr
}
reward_rets = sess.run(reward_fetches, feed_dict={
x1_ids: rets_data['batch']['x1_ids'], x1_len: rets_data['batch']['x1_len']
})
# prepare sample for classification
sample_rl = format_generated_samples_for_clf(proc, reward_rets['sample_rl'], reward_rets['sample_len'])
sample_base = format_generated_samples_for_clf(proc, reward_rets['greedy_sym'], reward_rets['greedy_len'])
# add reward calculation here
reward_rl = get_reward(rets_data['batch']['x4_emo'], sample_rl)
reward_base = get_reward(rets_data['batch']['x4_emo'], sample_base)
# self-critical reward
reward_sc = [rr - rb for rr, rb in zip(reward_rl, reward_base)]
print(reward_rl, reward_base, reward_sc) # just to see if reward is being computed correctly.
# (2) Optimize loss
feed_dict = {
x1_ids: rets_data['batch']['x1_ids'],
x1_len: rets_data['batch']['x1_len'],
x1x4_ids: rets_data['batch']['x1x4_ids'],
x1x4_len: rets_data['batch']['x1x4_len'],
tau: config_train.tau,
tx.global_mode(): tf.estimator.ModeKeys.TRAIN,
reward: reward_sc
}
fetches = {
'train_op': train_op,
'step': global_step,
}
fetches.update(loss_dict)
rets = sess.run(fetches, feed_dict, options=run_opts)
step = rets['step']
running train_op
(in fetches
) will consume GPU memory for gradient tensors. A quick test is to remove train_op
from fetches
and see if OOM is gone. If so, it means OOM is probably cuz rl_loss
results in more gradient tensors when running train_op
. I personally usually use tf.stop_gradient
to locate the back-propagation path(s) that lead to this extra OOM gradient tensors
@ZhitingHu Here is the error when I remove train_op from fetches.
But I am not pretty sure why we want to do that. Since when loss = loss_mle
and I pass this to train_op
, and then running fetches
with this train_op everything is okay. Plus what the program is optimizing without any train_op
?
error log when removing train_op
from fetches
:
Traceback (most recent call last):
File "roc_rl_main_refacored.py", line 1012, in <module>
tf.app.run()
File "/home/hannahbrahman/anaconda3/envs/py36/lib/python3.7/site-packages/tensorflow/python/platform/app.py", line 40, i
n run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/home/hannahbrahman/anaconda3/envs/py36/lib/python3.7/site-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/home/hannahbrahman/anaconda3/envs/py36/lib/python3.7/site-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "roc_rl_main_refacored.py", line 1001, in main
_train_epoch(sess, epoch==0)
File "roc_rl_main_refacored.py", line 724, in _train_epoch
rets = sess.run(fetches, feed_dict, options=run_opts)
File "/home/hannahbrahman/anaconda3/envs/py36/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 950
, in run
run_metadata_ptr)
File "/home/hannahbrahman/anaconda3/envs/py36/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 117
3, in _run
feed_dict_tensor, options, run_metadata)
File "/home/hannahbrahman/anaconda3/envs/py36/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 135
0, in _do_run
run_metadata)
File "/home/hannahbrahman/anaconda3/envs/py36/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 137
0, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: assertion failed: [] [Condition x == y did not hold element-wise:] [x (sequence_sparse_softmax_cro
ss_entropy_1/SparseSoftmaxCrossEntropyWithLogits/Shape_1:0) = ] [2 200] [y (sequence_sparse_softmax_cross_entropy_1/Sparse
SoftmaxCrossEntropyWithLogits/strided_slice:0) = ] [2 199]
[[node sequence_sparse_softmax_cross_entropy_1/SparseSoftmaxCrossEntropyWithLogits/assert_equal/Assert/Assert (de
fined at /home/hannah/Counterfactual-StoryRW/third_party/texar/texar/losses/mle_losses.py:196) ]]
[[mul_9/_5791]]
(1) Invalid argument: assertion failed: [] [Condition x == y did not hold element-wise:] [x (sequence_sparse_softmax_cro
ss_entropy_1/SparseSoftmaxCrossEntropyWithLogits/Shape_1:0) = ] [2 200] [y (sequence_sparse_softmax_cross_entropy_1/Sparse
SoftmaxCrossEntropyWithLogits/strided_slice:0) = ] [2 199]
[[node sequence_sparse_softmax_cross_entropy_1/SparseSoftmaxCrossEntropyWithLogits/assert_equal/Assert/Assert (de
fined at /home/hannah/Counterfactual-StoryRW/third_party/texar/texar/losses/mle_losses.py:196) ]]
0 successful operations.
0 derived errors ignored.
Original stack trace for 'sequence_sparse_softmax_cross_entropy_1/SparseSoftmaxCrossEntropyWithLogits/assert_equal[3/1885]
Assert':
File "roc_rl_main_refacored.py", line 1012, in <module>
tf.app.run()
File "/home/hannahbrahman/anaconda3/envs/py36/lib/python3.7/site-packages/tensorflow/python/platform/app.py", line 40, i
n run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/home/hannahbrahman/anaconda3/envs/py36/lib/python3.7/site-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/home/hannahbrahman/anaconda3/envs/py36/lib/python3.7/site-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "roc_rl_main_refacored.py", line 416, in main
batch_loss_rl = compute_batch_loss(symbols_output, symbols_len, x1_len)
File "roc_rl_main_refacored.py", line 324, in compute_batch_loss
sum_over_batch=False)
File "/home/hannah/Counterfactual-StoryRW/third_party/texar/texar/losses/mle_losses.py", line 196, in sequence_sparse_so
ftmax_cross_entropy
labels=labels, logits=logits)
File "/home/hannahbrahman/anaconda3/envs/py36/lib/python3.7/site-packages/tensorflow/python/ops/nn_ops.py", line 3355, i
n sparse_softmax_cross_entropy_with_logits
array_ops.shape(logits)[:-1]))
File "/home/hannahbrahman/anaconda3/envs/py36/lib/python3.7/site-packages/tensorflow/python/ops/check_ops.py", line 557,
in assert_equal
return control_flow_ops.Assert(condition, data, summarize=summarize)
File "/home/hannahbrahman/anaconda3/envs/py36/lib/python3.7/site-packages/tensorflow/python/util/tf_should_use.py", line
193, in wrapped
return _add_should_use_warning(fn(*args, **kwargs))
File "/home/hannahbrahman/anaconda3/envs/py36/lib/python3.7/site-packages/tensorflow/python/ops/control_flow_ops.py", li
ne 163, in Assert
return gen_logging_ops._assert(condition, data, summarize, name="Assert")
File "/home/hannahbrahman/anaconda3/envs/py36/lib/python3.7/site-packages/tensorflow/python/ops/gen_logging_ops.py", lin
e 74, in _assert
name=name)
File "/home/hannahbrahman/anaconda3/envs/py36/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py"
, line 788, in _apply_op_helper
op_def=op_def)
File "/home/hannahbrahman/anaconda3/envs/py36/lib/python3.7/site-packages/tensorflow/python/util/deprecation.py", line 5
07, in new_func
return func(*args, **kwargs)
File "/home/hannahbrahman/anaconda3/envs/py36/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 3616
, in create_op
op_def=op_def)
File "/home/hannahbrahman/anaconda3/envs/py36/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 2005
, in __init__
self._traceback = tf_stack.extract_stack()
Also, I got the exact same error as above when I tried using rl_loss_fine = tf.stop_gradient(rl_loss_fine)
.
I am sorry for inconvenience but I have no idea what's happenning or what am I doing wrong about rl_loss?
Removing train_op
or using tf.stop_gradient
is for debugging purpose -- to locate which portion of the code causes OOM. Once it's located and fixed, you do need to add back train_op
for training.
Based on the error msg after removing train_op
, it looks there is another bug related to sequence_sparse_softmax_cross_entropy
in compute_batch_loss
. It's necessary to fix this bug first.
Removing
train_op
or usingtf.stop_gradient
is for debugging purpose -- to locate which portion of the code causes OOM. Once it's located and fixed, you do need to add backtrain_op
for training.Based on the error msg after removing
train_op
, it looks there is another bug related tosequence_sparse_softmax_cross_entropy
incompute_batch_loss
. It's necessary to fix this bug first.
@ZhitingHu I was able to fix that bug, and now removing train_op
or using tf.stop_gradient
works without error.
When I add train_op
back, I got following error. How do I realize which part is causing the OOM?
What I am doing is that I trained a classifier beforehand and I am using it to compute rewards for my RL. The classifier is built on pytorch
. During my RL training, I am calling that pretrained classifier. I have two gpus and I let the model use both. At first I thought maybe sharing gpu between tensorflow and pytorch cause the error, but then I forced my pretrained classifier to work on cpu and I still get the following error:
2019-12-06 18:35:42.580893: W tensorflow/core/common_runtime/bfc_allocator.cc:319] *****************************************
***********************************************************
2019-12-06 18:35:42.580930: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at gpu_swapping_kernels.cc:72
: Resource exhausted: OOM when allocating tensor with shape[1024,1024] and type float on /job:localhost/replica:0/task:0/dev
ice:GPU:0 by allocator GPU_0_bfc
Traceback (most recent call last):
File "/home/hannahbrahman/anaconda3/envs/py36/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1356,
in _do_call
return fn(*args)
File "/home/hannahbrahman/anaconda3/envs/py36/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1341,
in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/home/hannahbrahman/anaconda3/envs/py36/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1429,
in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
(0) Resource exhausted: OOM when allocating tensor with shape[1024,1024] and type float on /job:localhost/replica:0/task:0
/device:GPU:0 by allocator GPU_0_bfc
[[{{node swap_in_transformer_decoder_1/layer_15/self_attention/multihead_attention/multihead_attention/key/Tensordo
t_1/MatMul_1}}]]
Current usage from device: /job:localhost/replica:0/task:0/device:GPU:0, allocator: GPU_0_bfc
196.32MiB from transpose
196.32MiB from OptimizeLoss/gradients/transformer_decoder_1/MatMul_grad/MatMul_1
21.88MiB from OptimizeLoss/gradients/transformer_decoder_1/layer_19/past_poswise_ln/ffn/conv1/Tensordot/MatMul_grad/MatMul
_1
16.00MiB from OptimizeLoss/gradients/transformer_decoder_1/layer_23/past_poswise_ln/ffn/conv2/Tensordot/MatMul_grad/MatMul
_1
16.00MiB from OptimizeLoss/gradients/transformer_decoder_1/layer_23/past_poswise_ln/ffn/conv1/Tensordot/MatMul_grad/MatMul
_1
16.00MiB from OptimizeLoss/gradients/transformer_decoder_1/layer_22/past_poswise_ln/ffn/conv2/Tensordot/MatMul_grad/MatMul
_1
16.00MiB from OptimizeLoss/gradients/transformer_decoder_1/layer_22/past_poswise_ln/ffn/conv1/Tensordot/MatMul_grad/MatMul
_1
16.00MiB from OptimizeLoss/gradients/transformer_decoder_1/layer_21/past_poswise_ln/ffn/conv2/Tensordot/MatMul_grad/MatMul
_1
16.00MiB from OptimizeLoss/gradients/transformer_decoder_1/layer_21/past_poswise_ln/ffn/conv1/Tensordot/MatMul_grad/MatMul
16.00MiB from OptimizeLoss/gradients/transformer_decoder_1/layer_0/past_poswise_ln/ffn/conv1/Tensordot/MatMul_gr[132/1934]
1
7.75MiB from OptimizeLoss/gradients/transformer_decoder_1/layer_21/self_attention/multihead_attention/multihead_attention/
key/Tensordot/MatMul_grad/MatMul_1
7.49MiB from OptimizeLoss/gradients/transformer_decoder_1/layer_22/self_attention/multihead_attention/multihead_attention/
query/Tensordot/MatMul_grad/MatMul_1
7.49MiB from OptimizeLoss/gradients/transformer_decoder_1/layer_22/self_attention/multihead_attention/multihead_attention/
key/Tensordot/MatMul_grad/MatMul_1
6.60MiB from OptimizeLoss/gradients/transformer_decoder_1/layer_14/self_attention/multihead_attention/multihead_attention/
key/Tensordot/MatMul_grad/MatMul_1
6.48MiB from OptimizeLoss/gradients/transformer_decoder_1/layer_18/self_attention/multihead_attention/multihead_attention/
value/Tensordot/MatMul_grad/MatMul_1
6.48MiB from OptimizeLoss/gradients/transformer_decoder_1/layer_15/self_attention/multihead_attention/multihead_attention/
output/Tensordot/MatMul_grad/MatMul_1
6.40MiB from OptimizeLoss/gradients/transformer_decoder_1/layer_19/self_attention/multihead_attention/multihead_attention/
query/Tensordot/MatMul_grad/MatMul_1
6.38MiB from OptimizeLoss/gradients/transformer_decoder_1/layer_21/self_attention/multihead_attention/multihead_attention/
output/Tensordot/MatMul_grad/MatMul_1
6.36MiB from OptimizeLoss/gradients/transformer_decoder_1/layer_7/self_attention/multihead_attention/multihead_attention/o
utput/Tensordot/MatMul_grad/MatMul_1
6.26MiB from OptimizeLoss/gradients/transformer_decoder_1/layer_20/self_attention/multihead_attention/multihead_attention/
value/Tensordot/MatMul_g
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "roc_rl_main_refacored.py", line 1005, in <module>
tf.app.run()
File "/home/hannahbrahman/anaconda3/envs/py36/lib/python3.7/site-packages/tensorflow/python/platform/app.py", line 40, in
run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/home/hannahbrahman/anaconda3/envs/py36/lib/python3.7/site-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/home/hannahbrahman/anaconda3/envs/py36/lib/python3.7/site-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "roc_rl_main_refacored.py", line 994, in main
_train_epoch(sess, epoch==0)
File "roc_rl_main_refacored.py", line 715, in _train_epoch
rets = sess.run(fetches, feed_dict, options=run_opts)
File "/home/hannahbrahman/anaconda3/envs/py36/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 950,
in run
run_metadata_ptr)
File "/home/hannahbrahman/anaconda3/envs/py36/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1173,
in _run
feed_dict_tensor, options, run_metadata)
File "/home/hannahbrahman/anaconda3/envs/py36/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1350,
in _do_run
run_metadata)
File "/home/hannahbrahman/anaconda3/envs/py36/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1370,
in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
(0) Resource exhausted: OOM when allocating tensor with shape[1024,1024] and type float on /job:localhost/replica:0/task:0
/device:GPU:0 by allocator GPU_0_bfc
[[{{node swap_in_transformer_decoder_1/layer_15/self_attention/multihead_attention/multihead_attention/key/Tensordo
t_1/MatMul_1}}]]
Current usage from device: /job:localhost/replica:0/task:0/device:GPU:0, allocator: GPU_0_bfc
196.32MiB from transpose
196.32MiB from OptimizeLoss/gradients/transformer_decoder_1/MatMul_grad/MatMul_1
hmm... The OOM is caused by the optimization (backward pass). Gradients of rl_loss_fine
and loss_mle
should consume the same amount of memory, respectively. To verify this -- since you've tried loss = loss_mle
and passed this to train_op
and it worked, does setting loss = rl_loss_fine
work (i.e., no OOM
)?
You may use tf.device
to partition the model on different GPUs. E.g., place the forward pass on one GPU, and train_op
(backward pass) on the other.
Another effective way to reduce memory consumption is to use a smaller max_seq_length
@ZhitingHu I really appreciate your help.
Yeah, that is a good test and actually I tried with just loss==rl_loss_fine
and it threw the same error. Note that, I used a batch_size=1
for this test. On the other hand, the model with loss=loss_mle
worked with batch_size=2
as well. Is it really a OOM
error?
How do I make sure, that during infer_sample
and infer_greedy
for RL, the model is reusing the parameters defined in train_greedy
decoder? Should I use a tf.variable_scope(, reuse=True)
somehow? Do you think that might be the reason for the error?
I am trying seq_len reduction and device partition as well but wanna make sure if it is really OOM
.
This where I sample two outputs for rl, and this where I compute rl_loss_fine. And where compute reward.
@ZhitingHu I changed the max_seq_len
from 200 to 128 and still get the same error for rl_loss_fine
.
Technically since both loss_mle
and rl_loss_fine
are using CE loss with respect to same parameters, they should consume the same amount of memory in backward path but with these tests it has been showed that it is not the case.
Also when I have multiple call to mle_loss
(I mean weighted some of mle_loss
) it is still working.
I just wanted to check if negative loss (may happen when reward of greedy output (r_base) is greater than reward of sampled output (r_sample) ) or very small loss (most of the time the difference of these two loss are very small and multiplying them by log_prob results in small values) may cause problem in backward path?
@ZhitingHu I really appreciate your help. Yeah, that is a good test and actually I tried with just
loss==rl_loss_fine
and it threw the same error. Note that, I used abatch_size=1
for this test. On the other hand, the model withloss=loss_mle
worked withbatch_size=2
as well. Is it really aOOM
error?How do I make sure, that during
infer_sample
andinfer_greedy
for RL, the model is reusing the parameters defined intrain_greedy
decoder? Should I use atf.variable_scope(, reuse=True)
somehow? Do you think that might be the reason for the error? I am trying seq_len reduction and device partition as well but wanna make sure if it is reallyOOM
.This where I sample two outputs for rl, and this where I compute rl_loss_fine. And where compute reward.
Texar automatically reuses variables. No need to add things like tf.variable_scope(, reuse=True)
.
FYI, here is an example code of using Texar for self-critic learning, where
(reward_sample - reward_greedy)
log p_theta(sample)
@ZhitingHu I really appreciate your help. Yeah, that is a good test and actually I tried with just
loss==rl_loss_fine
and it threw the same error. Note that, I used abatch_size=1
for this test. On the other hand, the model withloss=loss_mle
worked withbatch_size=2
as well. Is it really aOOM
error? How do I make sure, that duringinfer_sample
andinfer_greedy
for RL, the model is reusing the parameters defined intrain_greedy
decoder? Should I use atf.variable_scope(, reuse=True)
somehow? Do you think that might be the reason for the error? I am trying seq_len reduction and device partition as well but wanna make sure if it is reallyOOM
. This where I sample two outputs for rl, and this where I compute rl_loss_fine. And where compute reward.Texar automatically reuses variables. No need to add things like
tf.variable_scope(, reuse=True)
.FYI, here is an example code of using Texar for self-critic learning, where
* L.356 is calculating `(reward_sample - reward_greedy)` * L.392 is calculating `log p_theta(sample)`
Thank you so much @ZhitingHu. This was really helpful, I was able to figure out what I am doing wrong and now the OOM error is gone.
Glad to hear that! :) Could you briefly explain the cause of OOM, for future reference? Thanks
Glad to hear that! :) Could you briefly explain the cause of OOM, for future reference?
Glad to hear that! :) Could you briefly explain the cause of OOM, for future reference? Thanks
Thanks
Sure, what I was doing wrong was:
I was taking the sample_id
and logits
of the decoder in infer_sample
decoding strategy and passed this to sequence_sparse_softmax_cross_entropy
to compute logp.
However, I should have fixed sample_id (eos stripped and padded to same size) and then use this as input to decoder
in a train_greedy
decoding strategy and then used this output (sample_id, logits) to compute logp similar to how I compute mle_loss.
I know it's kind of irrelevant to implementation details but I wanted to know during training when we want to do evaluation on a dev set periodically, Is it more common to compute the reward on the greedy output or the most probable beam_search (of specific width)? Or both approach is common?
Thanks
Hi,
I was trying to write a function for computing reinforce loss (as below) when I realized you have this here. In this regard, how I can use the TransformerDecoder with ‘infer_sample’ decoding strategy as the sample_fn? In your reinforce_loss it is mentioned that the sample_fn should return [ids, probabilities, sequence_length], However the TransformerDecoder will return logits instead of probabilities. Can you guide me how I can use ‘TransformerDecoder’ with your reinforce_loss function? It should be a lot cleaner compared to my approach.
The way I was doing it, was to call
TransformerDecoder,
with 'infer_sample' decoding strategy and then took the log_softmax. like following: However, I am stuck in some steps, like gathering the log_probabilities according to the sample_ids. I could do it with numpy but not tensorflow.Also I am not sure if this is the right approach to compute loss, so either you can guide me with how to use reinforce_loss and TransformerDecoder sampling or help me with my own script, that would be highly appreciated.