cerndb / dist-keras

Distributed Deep Learning, with a focus on distributed training, using Keras and Apache Spark.
http://joerihermans.com/work/distributed-keras/
GNU General Public License v3.0
624 stars 169 forks source link

Address already in use #25

Closed CY-dev closed 7 years ago

CY-dev commented 7 years ago

Sometimes when training a model with a distributed learner implemented distkeras an error of the following form shows up:

Exception in thread Thread-22: Traceback (most recent call last): File "/usr/lib/anaconda2/lib/python2.7/threading.py", line 801, in __bootstrap_inner self.run() File "/usr/lib/anaconda2/lib/python2.7/threading.py", line 754, in run self.target(*self.args, *self.__kwargs) File "/dist-keras/distkeras/trainers.py", line 458, in service self.parameter_server.initialize() File "/dist-keras/distkeras/parameter_servers.py", line 111, in initialize file_descriptor.bind(('0.0.0.0', self.master_port)) File "/usr/lib/anaconda2/lib/python2.7/socket.py", line 228, in meth return getattr(self._sock,name)(args) error: [Errno 98] Address already in use

How could I avoid this error?

JoeriHermans commented 7 years ago

When you are using the framework in a notebook setting, and you interrupt the train method, then the parameter server is not cleaned up. Hence the error. I'll look into tomorrow if I can make it more robust, because this is an annoying issue indeed.

Joeri

JoeriHermans commented 7 years ago

Hi,

I 'fixed' this issue in commit 55ae5e. However, the worker tasks still continue on the Spark Executors, I'll try to kill the parameter server on the go using a KeyboardInterrupt. I'll check if this will kill the executors as well so the tasks are cleaned up nicely.

Joeri

JoeriHermans commented 7 years ago

https://github.com/cerndb/dist-keras/commit/945bf2e14e915590c4a4f5dd24c13e8b25f81422 should fix it, including cleaning up all the active tasks on the cluster. Please re-open the issue when you still experience the same issue.

Please note that the training task should be executed as (in a notebook setting):

try:
    trained_model = trainer.train(dataset)
except KeyboardInterrupt as e:
    trainer.parameter_server.stop()

However, we potentially could make a wrapper utility method which does this.

Joeri