cerndb / dist-keras

Distributed Deep Learning, with a focus on distributed training, using Keras and Apache Spark.
http://joerihermans.com/work/distributed-keras/
GNU General Public License v3.0
623 stars 169 forks source link

too much stderr on workers #54

Closed marty90 closed 6 years ago

marty90 commented 6 years ago

When using ADAG, yarn logs get full of verbose text. I guess the origin is at https://github.com/cerndb/dist-keras/blob/04cf7767e636cf614ea1fdb98753fe79647f81db/distkeras/workers.py#L340

Is it possible to disable it? It's becoming a DOS to our cluster! In 20 minutes I got more than 3 milion entries on my logs!

JoeriHermans commented 6 years ago

Pushed a fix in https://github.com/cerndb/dist-keras/commit/06c4e39954d9add3808042212a321febe21857b9 Could you verify your YARN logs with the patch?

Joeri

marty90 commented 6 years ago

Yes, line 340 was the killer! Thank You! BTW: one could add a "verbose" parameter to configure the standard output on executors. For debugging, it might useful to have such kind of output, while for production is harmful. Ciao!

JoeriHermans commented 6 years ago

In principle these metrics are collected in the history object which you can collect from trainer.get_history() :) I'm closing this issue now, feel free to open another issue whenever you have problems.

Joeri

marty90 commented 6 years ago

Cool! Thank you!