Closed matt-gardner closed 7 years ago
Do you have the traceback from the failed switch-in-loss-function run?
FWIW, testing whether the gradient is correct can be difficult and annoying. If you can't compute the gradient analytically, the best way to do it is with finite differences. Define some small h > 0
(e.g. 1e-3), then compute df/dx \approx (f(x+h) - f(x-h) / 2h)
.
@matt-peters: you can see the test log here: https://travis-ci.org/allenai/deep_qa/jobs/231293375. Some poking around showed that this line was returning all Nones
.
Also, @matt-peters, @DeNeutoy has done some more poking around with the loss function (that involves switch
) outside of any keras model, and it looks like tf computes the gradients through switch
just fine. It's looking like it might just be something weird that keras does with the loss function. Wish I understood it, though...
Strange indeed. Good to know.
Putting in @DeNeutoy's snippet:
import tensorflow as tf
import numpy as np
from deep_qa.training.losses import ranking_loss
inputs = tf.placeholder(tf.float32, [10, 20])
targets = tf.placeholder(tf.float32, [10, 20])
variable = tf.get_variable("weight", [10,20])
preds = inputs * variable
loss = ranking_loss(preds, targets)
optimiser = tf.train.AdamOptimizer(0.001)
grads = tf.gradients(loss, tf.trainable_variables())
train_op = optimiser.apply_gradients(zip(grads, tf.trainable_variables()))
session = tf.Session()
session.run(tf.global_variables_initializer())
for i in range(200):
_, actual_loss, actual_grads = session.run(
[train_op, loss, grads],
feed_dict={inputs: np.random.random([10,20]),
targets: np.random.random_integers(0, 1,[10, 20])}
)
print(actual_grads)
print(actual_loss)
I'm satisfied enough to close this now.
Maybe it's because y_true
only showed up in the condition of the switch
, not the resultant tensors? I know Keras has some constraints on how you use y_true
in the loss function - other people that have used ranking losses will multiply by y_true
just so that it shows up in the graph, even though they don't need it. Maybe that's what we were running into...
@DeNeutoy, just a catch-all place for exploration of issue #354.
So far, I heavily modified the decomposable attention layer test, to see if it was computing gradients. You can see in the test logs that it is indeed computing gradients for the embeddings. I don't have a good way to know if the gradient is correct, however.
I'm not sure what the deal was with using switch in the loss function.