Closed DeNeutoy closed 7 years ago
Does this method work and give performance speedups? E.g. if you run on multiple GPUs and manually inspect nvidia-smi, are all GPUs working equally hard? Or how much speedup do you get vs just using a larger batch on a single GPU?
It looks like you are creating only one copy of each variable/weight on a single device. Then multiple input and output tensors separated across devices. I'm curious what tensorflow would do in this situation:
Yeah - I think this is copying weights back and forth to the different GPUs, with the speedup being that this can be done in parallel - however, I need to look into the difference between this and using one of tensorflow's parameter servers, as that seems to be how they want people to do this distributed stuff...
This PR adds data parallelism which allows batch updates to be split across any number of GPUs with synchronous gradient updates performed on the CPU.
Additionally, I've added a tensorflow device scope to the
_embed_input
function, which should locate all embedding variables on the CPU.There are a few things to decide here: 1) Default behaviour when the number of available GPUS is not what it is in the model parameters. At the moment (and by default in Keras), we enable
soft_device_placement=True
in the session configuration, which means that tensorflow will try to put things on specified GPUs, but won't complain if they aren't available and will just allocate them some other way if the devices aren't there.2) Should we keep the device_placement logging/make it a parameter? It will print all of the operations in the graph and what device they are on, which is a lot of stuff, but also extremely helpful for debugging etc.