allenai / deep_qa

A deep NLP library, based on Keras / tf, focused on question answering (but useful for other NLP too)
Apache License 2.0
404 stars 132 forks source link

Multi gpu support #326

Closed DeNeutoy closed 7 years ago

DeNeutoy commented 7 years ago

This PR adds data parallelism which allows batch updates to be split across any number of GPUs with synchronous gradient updates performed on the CPU.

Additionally, I've added a tensorflow device scope to the _embed_input function, which should locate all embedding variables on the CPU.

There are a few things to decide here: 1) Default behaviour when the number of available GPUS is not what it is in the model parameters. At the moment (and by default in Keras), we enable soft_device_placement=True in the session configuration, which means that tensorflow will try to put things on specified GPUs, but won't complain if they aren't available and will just allocate them some other way if the devices aren't there.

2) Should we keep the device_placement logging/make it a parameter? It will print all of the operations in the graph and what device they are on, which is a lot of stuff, but also extremely helpful for debugging etc.

matt-peters commented 7 years ago

Does this method work and give performance speedups? E.g. if you run on multiple GPUs and manually inspect nvidia-smi, are all GPUs working equally hard? Or how much speedup do you get vs just using a larger batch on a single GPU?

It looks like you are creating only one copy of each variable/weight on a single device. Then multiple input and output tensors separated across devices. I'm curious what tensorflow would do in this situation:

DeNeutoy commented 7 years ago

Yeah - I think this is copying weights back and forth to the different GPUs, with the speedup being that this can be done in parallel - however, I need to look into the difference between this and using one of tensorflow's parameter servers, as that seems to be how they want people to do this distributed stuff...