[Not for merging] Investigating whether switch does the right thing with gradients

allenai / deep_qa

A deep NLP library, based on Keras / tf, focused on question answering (but useful for other NLP too)

Apache License 2.0

404 stars 133 forks source link

[Not for merging] Investigating whether switch does the right thing with gradients #355

Closed matt-gardner closed 7 years ago

matt-gardner commented 7 years ago

@DeNeutoy, just a catch-all place for exploration of issue #354.

So far, I heavily modified the decomposable attention layer test, to see if it was computing gradients. You can see in the test logs that it is indeed computing gradients for the embeddings. I don't have a good way to know if the gradient is correct, however.

I'm not sure what the deal was with using switch in the loss function.

matt-peters commented 7 years ago

Do you have the traceback from the failed switch-in-loss-function run?

matt-peters commented 7 years ago

FWIW, testing whether the gradient is correct can be difficult and annoying. If you can't compute the gradient analytically, the best way to do it is with finite differences. Define some small h > 0 (e.g. 1e-3), then compute df/dx \approx (f(x+h) - f(x-h) / 2h).

matt-gardner commented 7 years ago

@matt-peters: you can see the test log here: https://travis-ci.org/allenai/deep_qa/jobs/231293375. Some poking around showed that this line was returning all Nones.

matt-gardner commented 7 years ago

Also, @matt-peters, @DeNeutoy has done some more poking around with the loss function (that involves switch) outside of any keras model, and it looks like tf computes the gradients through switch just fine. It's looking like it might just be something weird that keras does with the loss function. Wish I understood it, though...

matt-peters commented 7 years ago

Strange indeed. Good to know.

matt-gardner commented 7 years ago

Putting in @DeNeutoy's snippet:

import tensorflow as tf
import numpy as np
from deep_qa.training.losses import ranking_loss

inputs = tf.placeholder(tf.float32, [10, 20])
targets = tf.placeholder(tf.float32, [10, 20])

variable = tf.get_variable("weight", [10,20])
preds = inputs * variable

loss = ranking_loss(preds, targets)

optimiser = tf.train.AdamOptimizer(0.001)

grads = tf.gradients(loss, tf.trainable_variables())

train_op = optimiser.apply_gradients(zip(grads, tf.trainable_variables()))

session = tf.Session()
session.run(tf.global_variables_initializer())
for i in range(200):
    _, actual_loss, actual_grads = session.run(
        [train_op, loss, grads],
        feed_dict={inputs: np.random.random([10,20]),
        targets: np.random.random_integers(0, 1,[10, 20])}
        )

    print(actual_grads)
    print(actual_loss)

matt-gardner commented 7 years ago

I'm satisfied enough to close this now.

matt-gardner commented 7 years ago

Maybe it's because y_true only showed up in the condition of the switch, not the resultant tensors? I know Keras has some constraints on how you use y_true in the loss function - other people that have used ranking losses will multiply by y_true just so that it shows up in the graph, even though they don't need it. Maybe that's what we were running into...