allenai / deep_qa

A deep NLP library, based on Keras / tf, focused on question answering (but useful for other NLP too)
Apache License 2.0
404 stars 132 forks source link

simple switch to tensorflow optimizers #317

Closed DeNeutoy closed 7 years ago

DeNeutoy commented 7 years ago

315

This is a very simple change which allows us to use tensorflow optimisers directly. However, this comes with a few caveats:

  1. It won't work with Keras Constraints. This is because tensorflow returns an operation which does the gradient updates for all variables simultaneously and doesn't return anything, whereas the Keras optimisers apply them individually and return the updated tensor, to which the constraint is applied. It's probably not worth our time getting around this problem, given this is a short term thing until we switch to full TF Keras.

  2. If you want to pass arguments to the TF optimiser class you need to use a dictionary of keyword arguments which correspond to their arguments, not the Keras ones. I think this is reasonable given the way we pass arguments to DeepQAModels anyway.

This should fix the IndexedSlices problem in that gradient updates for embeddings will be sparse, but doesn't address the training loop separation.

matt-gardner commented 7 years ago

I'm confused as to why this is showing up as not having code coverage in our tests - it makes it look like these optimizers aren't actually being used somehow, even though the default optimizer is "adam". Any idea what's going on here?

And have you actually tried running this? Does it help runtimes?

DeNeutoy commented 7 years ago

Ok so using the TF optimisers gives speedups which look linear in the size of the vocabulary and constant(ish, but still significant) in the size of the word embeddings. Measurements in seconds:

Just an embedding layer:

|Vocabulary|: 7000 TF: 0.028 Keras: 0.042 2x speedup

|Vocabulary|: 70,000 TF: 0.02929 Keras: 0.1470 7x speedup

|Vocabulary|: 700,000 TF: 0.0297 Keras: 1.4926 70x speedup Varying only the Embedding size, holding the vocabulary fixed at 70,000:

Embedding size: 128 Tensorflow: 0.00904 Keras: 0.03917 3x speedup

Embedding size: 512 TF: 0.02929 Keras: 0.1470 7x speedup

Embedding size: 1024 TF: 0.0514 Keras: 0.2659 5x speedup

Now for an LSTM with 512 inputs, using 512 dim embeddings with 70,000 vocab size: TF: 0.64 Keras: 0.71

Now for an LSTM with 512 inputs, using 512 dim embeddings with 700,000 vocab size:

TF: 0.63 Keras: 1.70 2x speedup

Now for an LSTM with 128 inputs, using 512 dim embeddings with 700,000 vocab size:

TF: 0.13 Keras: 1.20 9x speedup

Overall this certainly looks worthwhile. I'm trying to profile it on one of our actual models, but it's a bit tricky because we don't control the training loop execution, Keras does, but I'll try that this afternoon. Finally, the TF optimisers really don't seem to be playing well with how Keras represents sparse tensors - I am still getting errors when using gradient clipping (of any type). Additionally, we have certain places in our code such as the Keras learning rate decay callback which rely on using the Keras optimisers.

DeNeutoy commented 7 years ago

Hmm, ok maybe this isn't a straightforward switch - for the GA reader on sciQ, using 300 dim word embeddings, we have: Keras: 5.2 secs/batch Tensorflow 5.6 secs/batch.

although this will only have vocab size equal to sciq's vocab, which i'm guessing isn't very large. I was trying it on who did what but my machine ran out of memory. I'll try again tomorrow.

DeNeutoy commented 7 years ago

Ok so using Bidaf, we have:

100 dim embeddings: TF: 1.05 sec/batch average over 20 batches Keras 1.24 sec/batch average over 20 batches

800 dim embeddings: TF: 2.5 sec/batch average over 20 batches Keras 2.9 sec/batch average over 20 batches

matt-gardner commented 7 years ago

A free 20% speedup is really good. I'm not going to complain about that. I'm still confused about what's going on with the tests, though - it doesn't seem like it should be 41% coverage on the diff.

nelson-liu commented 7 years ago

I'm still confused about what's going on with the tests, though - it doesn't seem like it should be 41% coverage on the diff.

Agreed. I briefly poked at it, and the coverage report definitely seems correct; inserting a import pdb; pdb.set_trace() in any of the uncovered regions doesn't make the tests break, so the code definitely isn't being run. As to why, not too sure yet...

nelson-liu commented 7 years ago

nice, i found it. the default value of optimizer in trainer.py is just a string "adam". In the optimizer_from_params function, there's a condition to just return the input params if its a string, without doing anything to the optimizer dict. As a result, in this case it's bypassing the dicts completely and thus doesn't instantiate/test any of the wrapper classes.

a quick fix is to just change the default value of the optimizer param in Trainer.py linked above to {"type": "adam"}, as opposed to just "adam", I think. I ran the tests and got some failures, but the coverage on the optimizers.py should be reported correctly once those are fixed...

matt-gardner commented 7 years ago

Nice, thanks Nelson!

I'd probably just change optimizer_from_params to look like this:

if isinstance(params, str):
    optimizer = params
    params = {}
else:
    optimizer = params.pop_choice("type", optimizers.keys())
return optimizers[optimizer](**params)

That way you can still just pass in "adam" as the optimizer param to Trainer.

DeNeutoy commented 7 years ago

@matt-gardner I've cleaned up a few things here - the big issue is incorporating learning rate callbacks. The problem is that all the keras optimisers have a variable as the learning rate, which is updated by the callback. However, I can't bind the tensorflow ones to the same class variable because 1) they are not tensors and 2) different optimisers have different variable names for the learning rate, which makes doing this kind of tricky....

Also, there is a test in text_trainer which I cannot figure out for the life of me why it isn't passing. It's throwing an error when it trains because we are managing to pass it a tensor with a negative first dimension 😭 . However, all the other tests don't fail with this and are much more complex(in theory).

matt-gardner commented 7 years ago

I'm not sure what's going on with the test, but I can add some context - it's failing when model.fit is called a second time on the same model. Can you reproduce that in a repl with a simple model? I don't know why this might be happening, and probably won't have time to figure it out today, but that's my guess for what is going on here.

DeNeutoy commented 7 years ago

@matt-gardner - this now works - the only annoying thing is that I had to remove the learning rate scheduling in the GA Reader. The Keras LR callback is not going to work with the tensorflow optimisers as the learning rate is bound to different attributes for different optimisers - however, there are automatic learning rate decay functions in TF which we could use easily. I wanted to get this merged before it became too big though. I think it's a good first step towards exposing the actual session.run calls and I now have a better understanding of how that actually works in Keras, so it should be easier to do other cool stuff like arbitrary tensorboard logging within models.