About the calculation of GP loss

Hi ~, I have run your code on my computer with default commands mentioned in README.md, when I track the calculation of GP loss, I found a little bit confusing,

def wasserstein_penalty(discriminator, A_true, A_fake, params,
                        discriminator_params):
  A_interp = sample_along_line(A_true, A_fake, params)
  if params.use_embeddings:
    A_interp = softmax_to_embedding(A_interp, params)
  discrim_A_interp = discriminator(A_interp, discriminator_params, params)
  discrim_A_grads = tf.gradients(discrim_A_interp, [A_interp])

  if params.original_l2:
    l2_loss = tf.sqrt(
        tf.reduce_sum(
            tf.convert_to_tensor(discrim_A_grads)**2, axis=[1, 2]))
    if params.true_lipschitz:
      loss = params.wasserstein_loss * tf.reduce_mean(
          tf.nn.relu(l2_loss - 1)**2)
    else:
      loss = params.wasserstein_loss * tf.reduce_mean((l2_loss - 1)**2)
  else:
    loss = params.wasserstein_loss * (tf.nn.l2_loss(discrim_A_grads) - 1)**2
  return loss

When the A_interp has the shape [64, 100, 256], which can be annotated with [batch_size, seq_len, input_dim], and discrim_A_interp has shape [64, 2, 1], then tf.convert_to_tensor(discrim_A_grads) has shape [1, 64, 100, 256], but you apply reduce_sum on it along axis [1,2] instead of axis [2,3]?

for-ai / CipherGAN

About the calculation of GP loss #4