The system hangs (node where scheduler is running stops displaying iteration outcomes on the terminal. Logs are also not generated) whenever a server process is killed (both on local machine or on remote machine). Is it the expected behaviour ? Shouldn't it still continue to run with gradients being updated on the backup/replicated server node as described in the paper ?
Here are the steps that I ran (from "parameter_server/example/linear" dir):
The system hangs (node where scheduler is running stops displaying iteration outcomes on the terminal. Logs are also not generated) whenever a server process is killed (both on local machine or on remote machine). Is it the expected behaviour ? Shouldn't it still continue to run with gradients being updated on the backup/replicated server node as described in the paper ?
Here are the steps that I ran (from "parameter_server/example/linear" dir):
Then I killed a server process on one of the nodes. This stops the system. Killing a worker node, still continues the SGD and converges eventually.
Any help in this regard will be highly appreciated.
Thanks, Danish