dmlc / parameter_server

moved to https://github.com/dmlc/ps-lite
Apache License 2.0
649 stars 237 forks source link

System hangs when a server is killed #32

Open DanishKhan14 opened 7 years ago

DanishKhan14 commented 7 years ago

The system hangs (node where scheduler is running stops displaying iteration outcomes on the terminal. Logs are also not generated) whenever a server process is killed (both on local machine or on remote machine). Is it the expected behaviour ? Shouldn't it still continue to run with gradients being updated on the backup/replicated server node as described in the paper ?

Here are the steps that I ran (from "parameter_server/example/linear" dir):

../../script/ps.sh start -nw 4 -ns 3 -hostfile hostfile ../../build/linear -app_file ctr/online_l1lr.conf -num_replicas 2 -report_interval 1

Then I killed a server process on one of the nodes. This stops the system. Killing a worker node, still continues the SGD and converges eventually.

Any help in this regard will be highly appreciated.

Thanks, Danish