Multi gpu 2 - Githubissues

DeNeutoy commented 7 years ago

This is just a clean version of the other PR without the experimental bits and leaving out some irrelevant keras.compile refactoring.

DeNeutoy commented 7 years ago

@matt-gardner I have added a ModelCheckpoint class for the multi-gpu models which will only save the actual DeepQaModel layer. In terms of model loading, what do you think the default behavior should be? presumably when you are loading a model, you are doing some sort of evaluation so you probably don't want it to be parallelised? What do you think the default should be?

matt-gardner commented 7 years ago

Yeah, load a model onto a single GPU (or CPU). If we need to get more complicated than that later, we'll worry about it then.

matt-gardner commented 7 years ago

Not sure what the deal is with the sphinx warning; you might want to try building the docs locally to see if it helps: cd doc; make html-strict.

DeNeutoy commented 7 years ago

@matt-peters, I think you no longer need to use your Keras fork to do multi-gpu stuff (with some caveats). Please let me know if you don't agree with the below as it effects this PR:

This line in this commit added name scoping into building keras layers(although not to address this problem, just to make better layer scopes for tensorboard).

Now, this shouldn't change anything for variable creation, because tf.name_scope only scopes non-variable ops. Right? Wrong - it effects all ops with the exception of calls to tf.get_variable, but includes tf.Variable(), which is what keras uses in all of it's backend tensorflow variable creation methods. example:

with tf.name_scope("test"):
    var1 = tf.Variable([], name="var1")

assert var1.name == 'test/test_var:0' # True

with tf.name_scope("test"):
    var2 = tf.get_variable("var2", [])

assert var2.name == "var2:0"

This means that if we do something like:

model = KerasModel(inputs, outputs)
outputs = []
for x in range(3):
with tf.device("/gpu:{}".format(x)):
     outputs[x] = model(inputs)

new_model = KerasModel(inputs, concat(outputs))
model.compile(new_model)

Now we are reusing the model as a layer, it will have it's own scope when it is built and hence the layers will be reused, rather than built again - the same argument as why to create a shared lstm or something in Keras, you would do:

lstm = LSTM(64)

# Shared weights
lstm(input1)
lstm(input2)

Do you agree? The caveat here is that all of the layers must be built using model.compile or use a name scope when they are built.

matt-peters commented 7 years ago

In your second snippet, I agree that this will re-use the variables since the layers won't be rebuilt, in the same way that the third snippet will re-use them. One sanity check that I do is printing the results of tf.global_variables() after the final model is created. This will quickly show if there are duplicated variables that should be re-used.

The thing that I'm not sure about with this approach is how tensorflow will handle moving ops and variables from device to device and how that impacts efficiency. The only way I know to resolve this is to benchmark - if you run on say 2 GPUs with reasonably large batch size (so that computing gradients is slow relative to updating them) what is the speedup vs 1 GPU? And 4 GPUs vs 1 GPUs?

DeNeutoy commented 7 years ago

Ok great, I have a test which does exactly the global_variables thing, I was just checking that you also thought it was sufficient.

With regard to the efficiency thing, I think I have fixed that by pinning all the variables to the cpu using this scope function, which allocates new variables to the cpu even if they are created within a gpu scope. When I tested this, 2 and 3 gpus were no slower than using 1.

matt-peters commented 7 years ago

I'm confused by your last statement about efficiency - with linear scaling, 2 GPUs should be twice as fast as 1 GPU (for the same equivalent batch size), not about the same speed?

DeNeutoy commented 7 years ago

Oh I see - I think I must have a grip of how people talk about scaling wrong because Matt G also was confused by my measurements - basically 2 GPUs with 32 batch size on each (64 effective batch size) is as fast as 1 GPU with a 32 batch size, and the same with 3GPUs (96 effective batch size). Is that what you meant?

matt-gardner commented 7 years ago

Yeah, I'd call that linear scaling. You're right that it's not necessarily a linear speedup in learning time, because that interacts in interesting ways with the number of steps you take, and whatnot. But it's a linear speedup in how fast you can go through the data.

matt-peters commented 7 years ago

Ah yes, that's what I'd call linear scaling. Just to confirm, your timings are per batch, right (not per epoch)? We should expect the epoch time to decrease by half with 2 GPUs and 2x effective batch size.

Sounds like this is as efficient as we could hope for then, nice work!

DeNeutoy commented 7 years ago

Ah, I need to run these again as the comparisons were with an approximate batch size. Thanks for pointing that out!

DeNeutoy commented 7 years ago

Ok this isn't working properly, the time for a full epoch over the training data for 2 GPUs is only 50secs faster with 2 GPUs, I was thinking about the measurements wrong. I think the problem is the gradient computation - it needs to per gpu, with just the aggregation happening on the cpu.

DeNeutoy commented 7 years ago

Merging this, with the objective of adding more of tower functionality for the gradients to see if that speeds things up. @matt-peters do you have benchmarks for how fast this should be? What speedup did you see from running the multi-gpu stuff for your language models?

matt-peters commented 7 years ago

After running some benchmarks on my code, I'm thinking that this approach is only useful for taking larger batches then can fit on a GPU. For an SNLI model with LM (high ratio of computation to number of parameters) I'm seeing approximately linear speedup with number of GPUs. However, I can also get substantial speedup on a single GPU by just increasing batch size until it runs OOM.

Here are some benchmarks:

1 GPU, batch_size = 32, 24.6 examples / s
1 GPU, batch_size = 64, 45.7 examples / s
1 GPU, batch_size = 128: OOM

2 GPUs, batch_size = 32 x 2, 46.6 examples / s
2 GPUS, batch_size = 64 x 2, 86.9 examples / s

4 GPUs, batch_size = 64 x 4, 174.6 examples / s

matt-peters commented 7 years ago

Relevant paper:

https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45187.pdf

allenai / deep_qa

Multi gpu 2 #356