Open garlicbulb-puzhuo opened 7 years ago
By using Adam or the default SGD optimizer. This issue is gone.
Do we still see the issue even when using Adam or SGD optimizers?
Both type of optimizers produced NAN. We should thoroughly re-invesitgate the issue. Let's discuss tomorrow.
On Tue, Feb 21, 2017 at 10:14 PM, Sangmin Park notifications@github.com wrote:
Do we still see the issue even when using Adam or SGD optimizers?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/garlicbulb-puzhuo/ggo/issues/27#issuecomment-281556402, or mute the thread https://github.com/notifications/unsubscribe-auth/AHUzy819p3yc1h4_6M9V5_BjFBGj5-t0ks5re6gVgaJpZM4LX3Qz .
Problem
elephas launches an HTTP server (using Flask) along with the Spark driver to receive updated weights from each Spark worker. After the job runs for a while (a few hours or a few days), we start seeing nan for loss values for each epoch statistics.
Test Setup
For each of the hypotheses below, we run a test with the following setup.
Test0
We also run a baseline test without any changes made.
Hypothesis 1
The embedded HTTP server is multi-threaded. When getting updates from multiple workers simultaneously, it is possible that some race condition is met and data is corrupted.
Test1
To test this, we configure the HTTP server to be single threaded, as follows.
Hypothesis 2
The elehpas driver uses its own optimizer to update weights sent from each worker, the suggested one is adagrad.
Test2
To test this, we change the optimizer from adagrad to the default one, adam.