cerndb / dist-keras

Distributed Deep Learning, with a focus on distributed training, using Keras and Apache Spark.
http://joerihermans.com/work/distributed-keras/
GNU General Public License v3.0
624 stars 169 forks source link

Is it possible to use keras data generator with dist-keras #12

Closed oak-tree closed 7 years ago

oak-tree commented 7 years ago

Hello, We have some models that are using the keras flow_from_directory generators and some are using a custom made generator. Is there a simple processs to feed the dist-keras with those generators ?

Thanks!

JoeriHermans commented 7 years ago

Hi

In essence, no, since the custom data delivery is handled by Spark. I don't know the structure of your data, but let's take an arbitrary example.

  1. You put your data on HDFS (or a local director, if the path is identical on all workers).
  2. You read the data into a Spark DataFrame.
  3. You preprocess the data in parallel using Spark (it will create a computation graph).
  4. You feed the data to the trainer you like, and then Spark will ship the partitions to the workers in a "streaming" manner.

As you can see, you do not really need the Keras generators. Since I apply batch training on a worker level https://github.com/cerndb/dist-keras/blob/master/distkeras/workers.py#L283

Example: https://github.com/cerndb/dist-keras/blob/master/examples/mnist.py

To summarize, if you would implement the generator with Spark, you would achieve identical behavior, and benefit from the distributed preprocessing. I hope this answers your question.

Joeri

oak-tree commented 7 years ago

Thanks for the quick reply!

The reason I use generator is that we have a large amount of data which not fit into the memory all at once. Moreover we do a random augmentation for each image we use before feeding into the trainer.

I'm new to spark. If you can direct me to the right place it can be great. What if all of our data is already accessible via each of our machine - lets for 2 - single gpu for each machine.

What I would like to achieve is to tell each of the machine to train async and aggregate their result into one of the machine.

So basically - please correct me If I'm wrong, what i need from spark/dist-keras is to be able to tell the machine to starts and aggerate their result. No need to broadcast the data it self between the machine as each machine already owns its copy.

I have two question regarding it

  1. What do you think would fit the most to this task ?
  2. Will it works also with a single machine with multiply GPU?

Thanks again! Alon

On Thu, Mar 9, 2017 at 10:09 AM, Joeri Hermans notifications@github.com wrote:

Hi

In essence, no, since the custom data delivery is handled by Spark. I don't know the structure of your data, but let's take an arbitrary example.

  1. You put your data on HDFS (or a local director, if the path is identical on all workers).
  2. You read the data into a Spark DataFrame.
  3. You preprocess the data in parallel using Spark (it will create a computation graph).
  4. You feed the data to the trainer you like, and then Spark will ship the partitions to the workers in a "streaming" manner.

As you can see, you do not really need the Keras generators. Since I apply batch training on a worker level https://github.com/cerndb/ dist-keras/blob/master/distkeras/workers.py#L283

Example: https://github.com/cerndb/dist-keras/blob/master/ examples/mnist.py

To summarize, if you would implement the generator with Spark, you would achieve identical behavior, and benefit from the distributed preprocessing. I hope this answers your question.

Joeri

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/cerndb/dist-keras/issues/12#issuecomment-285283025, or mute the thread https://github.com/notifications/unsubscribe-auth/AAouEdnH17AF7H7N0m599yaes4IodKCrks5rj7OegaJpZM4MXwDJ .

JoeriHermans commented 7 years ago

Hi Alon,

I made an MNIST example which includes all the preprocessing steps with Spark, and then feeding it to a distributed optimizer https://github.com/cerndb/dist-keras/blob/master/examples/mnist.ipynb

To be honest, I don't think this framework would be a good approach for you since using Spark is really beneficial when it is spread over multiple machines. However, it should work, if you want to try it out. If you choose to try it out, use the ADAG optimizer, which is in my opinion, the most stable one. Also note, if you want to use the GPU's, make sure you configured your keras.json file in your home directory accordingly.

Also note, in asynchronous optimization, there is something called "implicit momentum". I have described it here https://db-blog.web.cern.ch/blog/joeri-hermans/2017-01-distributed-deep-learning-apache-spark-and-keras But since you will only use 2 workers, this shouldn't be an issue. However, I'm currently experimenting with some techniques which are able to mitigate for this (Experimental trainer in this framework, if you like I can send you the slides?).

Kind regards,

Joeri

oak-tree commented 7 years ago

Hey Joeri Thanks for the very detailed answer. I'll check up the links, and sure I would love to get the slides. Meanwhile I'm looking for way to accelerate my learning in the meaning of reduce "Apoch" time. I might have another server with 2-4 gpu. But now I have 2 servers with 1 gpu each. If you have any idea that I can try I be happy to hear :)

JoeriHermans commented 7 years ago

Hey Alon,

For my master thesis I have to do some experiments which possibly bring the training time (your epoch time) even further down. I think if you check in like 1 or 2 weeks, I will support a new implementation which doesn't require Spark. As a result, you can use your generators.

Joeri