cerndb / dist-keras

Distributed Deep Learning, with a focus on distributed training, using Keras and Apache Spark.
http://joerihermans.com/work/distributed-keras/
GNU General Public License v3.0
624 stars 169 forks source link

multiple modifications #29

Closed CY-dev closed 7 years ago

CY-dev commented 7 years ago

Hi,

I have made several changes based on what we need in our use case. Please tell me if you have any questions, and merge them if you find them useful.

4 API changes to train method of Trainer and its subclasses:

2 internal implementation changes:

JoeriHermans commented 7 years ago

Hi,

Looks very solid, thanks. However I have some doubts with handling the epochs at a worker level as I did this before in a initial prototype. Is the unionAll significantly impairing the training procedure?

Also, thanks for fixing the docs :)

Joeri

JoeriHermans commented 7 years ago

Any updates? I see that a lot of commits have been pushed in the meantime.

CY-dev commented 7 years ago

Hi Joeri,

The unionAll is making copies of the data. So when initially the Spark Dataframe has 10 partitions, it will have 50 if we want to train 5 epochs using unionAll, which seems inefficient.

Thanks, Congrui

From: Joeri Hermans notifications@github.com Reply-To: cerndb/dist-keras reply@reply.github.com Date: Thursday, June 22, 2017 at 1:38 AM To: cerndb/dist-keras dist-keras@noreply.github.com Cc: Congrui Yi cyi@expedia.com, Author author@noreply.github.com Subject: Re: [cerndb/dist-keras] multiple modifications (#29)

Hi,

Looks very solid, thanks. However I have some doubts with handling the epochs at a worker level as I did this before in a initial prototype. Is the unionAll significantly impairing the training procedure?

Also, thanks for fixing the docs :)

Joeri

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/cerndb/dist-keras/pull/29#issuecomment-310314943, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AJdSpbYK_R-VUdDLM_qoVaajgR7Or31yks5sGifugaJpZM4OAQcR.

mcmathja commented 7 years ago

@JoeriHermans The main issue is that calling cache on a dataframe forces Spark to materialize its current state in memory before continuing. Because the call to unionAll comes before that, you end up actually copying out the entirety of the duplicated data. That overhead gets quite bad when running a large number of epochs.

Would love to see this branch merged. 👍

JoeriHermans commented 7 years ago

Hi @mcmathja That makes sense, I only realize it now :) Thanks. I'm going to review the final code now and merge it when I have no further issues.

JoeriHermans commented 7 years ago

Looks very good guys, thanks a lot! Could you check the final comment I've made on the debug-messages? Also, did the GPU allocation work?

Also, I did some research on distributed optimization in the meantime which includes some of the optimizers you have been using :) If interested: https://cds.cern.ch/record/2276711

JoeriHermans commented 7 years ago

Merging. Very good job guys!

Joeri