elephas nan error - Githubissues

garlicbulb-puzhuo commented 7 years ago

Problem

elephas launches an HTTP server (using Flask) along with the Spark driver to receive updated weights from each Spark worker. After the job runs for a while (a few hours or a few days), we start seeing nan for loss values for each epoch statistics.

Test Setup

For each of the hypotheses below, we run a test with the following setup.

6 Spark workers
image size: 128 * 128
epoch: 100
train batch size per iteration: 2

Test0

We also run a baseline test without any changes made.

Hypothesis 1

The embedded HTTP server is multi-threaded. When getting updates from multiple workers simultaneously, it is possible that some race condition is met and data is corrupted.

Test1

To test this, we configure the HTTP server to be single threaded, as follows.

-        self.app.run(host='0.0.0.0', debug=True, port=self.master_server_port,
                     threaded=True, use_reloader=False)
+        self.app.run(host='0.0.0.0', debug=True, port=self.master_server_port,
                     threaded=False, use_reloader=False)

Hypothesis 2

The elehpas driver uses its own optimizer to update weights sent from each worker, the suggested one is adagrad.

Test2

To test this, we change the optimizer from adagrad to the default one, adam.

-    spark_model = SparkModel(sc, model, optimizer=adagrad, frequency='epoch',
+    spark_model = SparkModel(sc, model, frequency='epoch',
                              mode='asynchronous', num_workers=4, master_loss=custom_loss, master_server_port=master_server_port,
                              model_callbacks=[spark_model_callback])

garlicbulb-puzhuo commented 7 years ago

By using Adam or the default SGD optimizer. This issue is gone.

tomowind commented 7 years ago

Do we still see the issue even when using Adam or SGD optimizers?

oakcreek commented 7 years ago

Both type of optimizers produced NAN. We should thoroughly re-invesitgate the issue. Let's discuss tomorrow.

On Tue, Feb 21, 2017 at 10:14 PM, Sangmin Park notifications@github.com wrote:

Do we still see the issue even when using Adam or SGD optimizers?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/garlicbulb-puzhuo/ggo/issues/27#issuecomment-281556402, or mute the thread https://github.com/notifications/unsubscribe-auth/AHUzy819p3yc1h4_6M9V5_BjFBGj5-t0ks5re6gVgaJpZM4LX3Qz .

garlicbulb-puzhuo / ggo

elephas nan error #27