cerndb / dist-keras

Distributed Deep Learning, with a focus on distributed training, using Keras and Apache Spark.
http://joerihermans.com/work/distributed-keras/
GNU General Public License v3.0
623 stars 169 forks source link

multi GPU on single machine support #31

Open marcoleewow opened 7 years ago

marcoleewow commented 7 years ago

Hi, are there any plans for multi-GPUs on a single machine support? I would like to use downpour (asynchronous) SGD in my training, which is not supported in Keras.

JoeriHermans commented 7 years ago

If the machines have GPU's, it should automatically allocate the GPU's during training. Let me know if it doesn't work. Personally I wouldn't use DOWNPOUR due to "implicit momentum" as the number of asynchronous processes is too high, but rather AGN or ADAG, see https://github.com/JoeriHermans/master-thesis/blob/master/thesis/master_thesis_joeri_hermans.pdf for more information.

Joeri

JoeriHermans commented 7 years ago

If the community is interested, I can make a Spark-less script which does the thing you want. Should be rather trivial to implement.

marcoleewow commented 7 years ago

I don't have problems with running multiple models in each of my GPUs, and I can also use synchronous SGD using tensorflow (see here), but so far there is no asynchronous SGD code available for multi GPU system. It would be amazing if you can implement the codes :) I will be more than happy to help you out! Note that I am still a newbie though...

JoeriHermans commented 7 years ago

I wouldn't use synchronous SGD since it implicitly increases the size of a mini-batch :) If you want we can work together (I don't have multiple GPU's to test atm), drop me an e-mail: joeri@joerihermans.com. Then we can make a Gist or something for everyone.

pengpaiSH commented 7 years ago

@JoeriHermans Hi, Hermans. I will be very interested if you could make a Multi-GPUs without Spark environment!

imranshaikmuma commented 7 years ago

I am working on the same thing from few weeks. Methods I currently found to use model to train on Multiple GPUs on a single instance:

  1. Manually assign GPU device by giving tf.device statements. The problem with this is we have to manually calculate the average of losses/weights etc. for each tower (GPU) and update those variables which seemed tough to me.
  2. Use MXNET to do this and give CONTEXT variable while training which consists of list of all GPUs available to use. The problem with this is it has all the mostly used cells but not as many as KERAS has. But it will be suitable to 95 percent of our models i felt. It calculates the average loss/weights/gradients of all GPUs automatically. You dont need to manually code them.
  3. Use ELEPHAS to distribute training across multiple WORKER machines. That is having so many GPUs on different machines not on same machine. This library is going to extend to multiple GPUs on same instance soon. Watch out for this!!. For now this extends KERAS library to SPARK which is awesome if we have multiple worker machines
  4. Use Distributed KERAS which does the same as ELEPHAS but I havent tried by myself.

While using all these libraries dont forget to use TENSORFLOW-GPU version not regular TENSORFLOW with CUDNN libraries. CUDNN is must to solve the training time. It reduces so well.

Let me know if I am wrong understanding all these. But i have used MXNET, ELEPHAS for now. Please also let me know if you do find some other libraries. I am willing to contribute and help you guys to the best I can. Looking forward to hear your opinion on this.

mohaimenz commented 6 years ago

@imranshaikmuma Is it possible to use ELEPHAS for distributed training in my local machine? My local machine has 8 cores and a 4GB NVIDIA GPU. I would like to do the training in multiple workers for test purpose. Could you please tell me if it possible? I am trying it but it is giving me exception.