Closed CY-dev closed 7 years ago
When you are using the framework in a notebook setting, and you interrupt the train method, then the parameter server is not cleaned up. Hence the error. I'll look into tomorrow if I can make it more robust, because this is an annoying issue indeed.
Joeri
Hi,
I 'fixed' this issue in commit 55ae5e. However, the worker tasks still continue on the Spark Executors, I'll try to kill the parameter server on the go using a KeyboardInterrupt
. I'll check if this will kill the executors as well so the tasks are cleaned up nicely.
Joeri
https://github.com/cerndb/dist-keras/commit/945bf2e14e915590c4a4f5dd24c13e8b25f81422 should fix it, including cleaning up all the active tasks on the cluster. Please re-open the issue when you still experience the same issue.
Please note that the training task should be executed as (in a notebook setting):
try:
trained_model = trainer.train(dataset)
except KeyboardInterrupt as e:
trainer.parameter_server.stop()
However, we potentially could make a wrapper utility method which does this.
Joeri
Sometimes when training a model with a distributed learner implemented distkeras an error of the following form shows up:
Exception in thread Thread-22: Traceback (most recent call last): File "/usr/lib/anaconda2/lib/python2.7/threading.py", line 801, in __bootstrap_inner self.run() File "/usr/lib/anaconda2/lib/python2.7/threading.py", line 754, in run self.target(*self.args, *self.__kwargs) File "/dist-keras/distkeras/trainers.py", line 458, in service self.parameter_server.initialize() File "/dist-keras/distkeras/parameter_servers.py", line 111, in initialize file_descriptor.bind(('0.0.0.0', self.master_port)) File "/usr/lib/anaconda2/lib/python2.7/socket.py", line 228, in meth return getattr(self._sock,name)(args) error: [Errno 98] Address already in use
How could I avoid this error?