Closed CY-dev closed 7 years ago
Hi,
Looks very solid, thanks. However I have some doubts with handling the epochs at a worker level as I did this before in a initial prototype. Is the unionAll significantly impairing the training procedure?
Also, thanks for fixing the docs :)
Joeri
Any updates? I see that a lot of commits have been pushed in the meantime.
Hi Joeri,
The unionAll is making copies of the data. So when initially the Spark Dataframe has 10 partitions, it will have 50 if we want to train 5 epochs using unionAll, which seems inefficient.
Thanks, Congrui
From: Joeri Hermans notifications@github.com Reply-To: cerndb/dist-keras reply@reply.github.com Date: Thursday, June 22, 2017 at 1:38 AM To: cerndb/dist-keras dist-keras@noreply.github.com Cc: Congrui Yi cyi@expedia.com, Author author@noreply.github.com Subject: Re: [cerndb/dist-keras] multiple modifications (#29)
Hi,
Looks very solid, thanks. However I have some doubts with handling the epochs at a worker level as I did this before in a initial prototype. Is the unionAll significantly impairing the training procedure?
Also, thanks for fixing the docs :)
Joeri
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/cerndb/dist-keras/pull/29#issuecomment-310314943, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AJdSpbYK_R-VUdDLM_qoVaajgR7Or31yks5sGifugaJpZM4OAQcR.
@JoeriHermans The main issue is that calling cache
on a dataframe forces Spark to materialize its current state in memory before continuing. Because the call to unionAll
comes before that, you end up actually copying out the entirety of the duplicated data. That overhead gets quite bad when running a large number of epochs.
Would love to see this branch merged. 👍
Hi @mcmathja That makes sense, I only realize it now :) Thanks. I'm going to review the final code now and merge it when I have no further issues.
Looks very good guys, thanks a lot! Could you check the final comment I've made on the debug-messages? Also, did the GPU allocation work?
Also, I did some research on distributed optimization in the meantime which includes some of the optimizers you have been using :) If interested: https://cds.cern.ch/record/2276711
Merging. Very good job guys!
Joeri
Hi,
I have made several changes based on what we need in our use case. Please tell me if you have any questions, and merge them if you find them useful.
4 API changes to train method of Trainer and its subclasses:
also added support for multi-output models by passing a list of output column names to label_col argument
as a follow-up with mult-output support, also added loss_weights as an argument
accept a keras optimizer instance to be passed to the "optimizer" argument (from keras.optimizers import Optimizer, serialize, deserialize to deal with the object internally)
In the notebook environment, it seems parameter_server is not quite cleaned up when the trainer.train method gets interrupted, even with try-except approach described in issue #25. It still prints the "Address already in use" message. Therefore I added master_port as an argument to "train" method of Trainer classes as a work-around
2 internal implementation changes:
for num_epoch > 1, instead of data itself make copies of iterator via itertools.tee function. This change could help avoid time-consuming "unionAll" method due to data copying
simply coalesce instead of repartition when num_partitions = num_workers, since coalescing does not shuffle the data and require much less time than repartitioning.