Hi ~, I have run your code on my computer with default commands mentioned in README.md,
when I track the calculation of GP loss, I found a little bit confusing,
def wasserstein_penalty(discriminator, A_true, A_fake, params,
discriminator_params):
A_interp = sample_along_line(A_true, A_fake, params)
if params.use_embeddings:
A_interp = softmax_to_embedding(A_interp, params)
discrim_A_interp = discriminator(A_interp, discriminator_params, params)
discrim_A_grads = tf.gradients(discrim_A_interp, [A_interp])
if params.original_l2:
l2_loss = tf.sqrt(
tf.reduce_sum(
tf.convert_to_tensor(discrim_A_grads)**2, axis=[1, 2]))
if params.true_lipschitz:
loss = params.wasserstein_loss * tf.reduce_mean(
tf.nn.relu(l2_loss - 1)**2)
else:
loss = params.wasserstein_loss * tf.reduce_mean((l2_loss - 1)**2)
else:
loss = params.wasserstein_loss * (tf.nn.l2_loss(discrim_A_grads) - 1)**2
return loss
When the A_interp has the shape [64, 100, 256], which can be annotated with [batch_size, seq_len, input_dim], and discrim_A_interp has shape [64, 2, 1], then tf.convert_to_tensor(discrim_A_grads) has shape [1, 64, 100, 256], but you apply reduce_sum on it along axis [1,2] instead of axis [2,3]?
Thanks for pointing that out! I've pushed a fix for the bug. I don't expect it to throw off results, however let us know if there needs to be any retuning of hyperparams to compensate for the change in scale.
Hi ~, I have run your code on my computer with default commands mentioned in README.md, when I track the calculation of GP loss, I found a little bit confusing,
When the
A_interp
has the shape [64, 100, 256], which can be annotated with [batch_size, seq_len, input_dim], anddiscrim_A_interp
has shape [64, 2, 1], thentf.convert_to_tensor(discrim_A_grads)
has shape [1, 64, 100, 256], but you applyreduce_sum
on it along axis [1,2] instead of axis [2,3]?