SpiNNakerManchester / SpiNNFrontEndCommon

Common support code for user-facing front end systems.
Apache License 2.0
12 stars 11 forks source link

"OSError: [Errno 98] Address already in use" when trying to run a second LiveEventConnection in the same program (after the first was closed properly) #646

Open jofas opened 4 years ago

jofas commented 4 years ago

Hi all,

I recently realized that I can't run my application two times within the same program, because the socket used for the LiveEventConnection somehow isn't properly closed (according to my OS). System resources like sockets are always tricky I guess.

My program looks something like this:

import spinnaker_graph_front_end as gfe
from spinn_front_end_common.utilities.connections import LiveEventConnection
from spinn_front_end_common.utilities.constants import NOTIFY_PORT
from spinn_front_end_common.utilities.globals_variables import get_simulator
from spinn_utilities.socket_address import SocketAddress

def execute_application_on_spinnaker_with_live_io():
  # setup some graph
  # ...

  gfe.setup(...)

  database_socket = SocketAddress(
    listen_port=22222,
    notify_host_name="127.0.0.1",
    notify_port_no=NOTIFY_PORT # 19999
  )
  get_simulator().add_socket_address(database_socket)  

  conn = LiveEventConnection(...)
  gfe.run(10)
  gfe.stop()
  conn.close()

execute_application_on_spinnaker_with_live_io() # works fine
execute_application_on_spinnaker_with_live_io() # throws error

And for the second execution of execute_application_on_spinnaker_with_live_io() I receive the following error (with stracktrace from the real program I tried running):

2020-08-13 16:42:21 ERROR: Shutdown on exception
Traceback (most recent call last):
  File "/home/masterusr/spinnaker/SpiNNMan/spinnman/connections/udp_packet_connections/utils.py", line 70, in bind_socket                                                                                                                    
    sock.bind((str(host), int(port)))
OSError: [Errno 98] Address already in use

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "test/test_inference.py", line 74, in <module>
    test_inference_conv1d()
  File "test/test_inference.py", line 65, in test_inference_conv1d
    p = model.predict(X)
  File "/home/masterusr/src/master_thesis/SpiDNN/spiDNN/model.py", line 58, in predict
    conn = self._setup_predict_live_event_connection(extractor, X, result)
  File "/home/masterusr/src/master_thesis/SpiDNN/spiDNN/model.py", line 191, in _setup_predict_live_event_connection
    machine_vertices=True
  File "/home/masterusr/spinnaker/SpiNNFrontEndCommon/spinn_front_end_common/utilities/connections/live_event_connection.py", line 98, in __init__                                                                                           
    local_host=local_host, local_port=local_port)
  File "/home/masterusr/spinnaker/SpiNNFrontEndCommon/spinn_front_end_common/utilities/database/database_connection.py", line 62, in __init__                                                                                                
    remote_host=None, remote_port=None)
  File "/home/masterusr/spinnaker/SpiNNMan/spinnman/connections/udp_packet_connections/udp_connection.py", line 68, in __init__                                                                                                              
    bind_socket(self._socket, local_bind_host, local_bind_port)
  File "/home/masterusr/spinnaker/SpiNNMan/spinnman/connections/udp_packet_connections/utils.py", line 74, in bind_socket                                                                                                                    
    host, port, exception)), exception)
  File "<string>", line 3, in raise_from
spinnman.exceptions.SpinnmanIOException: IO Error: Error binding socket to :19999: [Errno 98] Address already in use

I researched a little and found a way to tell the OS that sockets are reused. So I added

sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)

to the get_socket() function here, but it did not work unfortunately. With this line added I don't receive an error no more, but the second LiveEventConnection sometimes sends packets and sometimes doesn't.

Maybe I'm making a mistake setting up the socket in the first place?

Kind regards, Jonas

rowleya commented 3 years ago

There are general issues with sockets not being closed sometimes. A better "workaround" then is to ask the tools to bind to a random socket each time.

An example of this is:

import spinnaker_graph_front_end as gfe
from spinn_front_end_common.utilities.connections import LiveEventConnection
from spinn_front_end_common.utilities.constants import NOTIFY_PORT
from spinn_front_end_common.utilities.globals_variables import get_simulator
from spinn_utilities.socket_address import SocketAddress

def execute_application_on_spinnaker_with_live_io():
  # setup some graph
  # ...

  gfe.setup(...)

  conn = LiveEventConnection(..., local_port=None)

  database_socket = SocketAddress(
    listen_port=22222,
    notify_host_name="127.0.0.1",
    notify_port_no=conn.local_port
  )
  get_simulator().add_socket_address(database_socket)  

  gfe.run(10)
  gfe.stop()
  conn.close()

execute_application_on_spinnaker_with_live_io()
execute_application_on_spinnaker_with_live_io()

This reverses the order, creating the socket first but asking for a random port (local_port=None) and then requests the port that the socket bound to in the add_socket_address call (conn.local_port). Note that you may also run into trouble with the listen_port, which I think can be set to None in general I think.