fkie / multimaster_fkie

ROS stack with FKIE packages for multi-robot (discovering, synchronizing and management GUI)
BSD 3-Clause "New" or "Revised" License
268 stars 108 forks source link

Timeout causing sync/update failures. #29

Closed mikeodr closed 8 years ago

mikeodr commented 8 years ago

Hello,

I'm encountering an issue where network latency timeouts appears to be causing the following errors to occur.

[ERROR][1452869534.196188][/master_sync]: SyncThread[dev-vm-nimbus-02-gazebo.local] ERROR: Traceback (most recent call last):
File "/opt/ros/indigo/lib/python2.7/dist-packages/master_sync_fkie/sync_thread.py", line 233, in _request_remote_state
remote_state = remote_monitor.masterInfoFiltered(self._filter.to_list())
File "/usr/lib/python2.7/xmlrpclib.py", line 1233, in _call_
return self._send(self._name, args)
File "/usr/lib/python2.7/xmlrpclib.py", line 1587, in __request
verbose=self.__verbose
File "/usr/lib/python2.7/xmlrpclib.py", line 1273, in request
return self.single_request(host, handler, request_body, verbose)
File "/usr/lib/python2.7/xmlrpclib.py", line 1301, in single_request
self.send_content(h, request_body)
File "/usr/lib/python2.7/xmlrpclib.py", line 1448, in send_content
connection.endheaders(request_body)
File "/usr/lib/python2.7/httplib.py", line 975, in endheaders
self._send_output(message_body)
File "/usr/lib/python2.7/httplib.py", line 835, in _send_output
self.send(msg)
File "/usr/lib/python2.7/httplib.py", line 797, in send
self.connect()
File "/usr/lib/python2.7/httplib.py", line 778, in connect
self.timeout, self.source_address)
File "/usr/lib/python2.7/socket.py", line 571, in create_connection
raise err
timeout: timed out

and

[ERROR] [WallTime: 1452806226.279808] [1452806216.276963] SyncThread[dev-vm-nimbus-02-robot2.local] ERROR: Traceback (most recent call last):
File "/opt/ros/indigo/lib/python2.7/dist-packages/master_sync_fkie/sync_thread.py", line 341, in _apply_remote_state
result = own_master_multi()
File "/usr/lib/python2.7/xmlrpclib.py", line 998, in _call_
return MultiCallIterator(self.__server.system.multicall(marshalled_list))
File "/usr/lib/python2.7/xmlrpclib.py", line 1233, in _call_
return self._send(self._name, args)
File "/usr/lib/python2.7/xmlrpclib.py", line 1587, in __request
verbose=self.__verbose
File "/usr/lib/python2.7/xmlrpclib.py", line 1273, in request
return self.single_request(host, handler, request_body, verbose)
File "/usr/lib/python2.7/xmlrpclib.py", line 1301, in single_request
self.send_content(h, request_body)
File "/usr/lib/python2.7/xmlrpclib.py", line 1448, in send_content
connection.endheaders(request_body)
File "/usr/lib/python2.7/httplib.py", line 975, in endheaders
self._send_output(message_body)
File "/usr/lib/python2.7/httplib.py", line 835, in _send_output
self.send(msg)
File "/usr/lib/python2.7/httplib.py", line 797, in send
self.connect()
File "/usr/lib/python2.7/httplib.py", line 778, in connect
self.timeout, self.source_address)
File "/usr/lib/python2.7/socket.py", line 571, in create_connection
raise err
error: [Errno 111] Connection refused

After these errors occur it appears as though the syncing and multimaster communication falls over.

I have a suspicion that it is sync_thread.py#L396.

self._update_timer = threading.Timer(random.random() * 2., self._request_remote_state, args=(self._apply_remote_state,))
self._update_timer.start()

The exception bypasses the timer starting again and sync updates are no longer running.

I'm happy to provide a fix, but would like to confirm my suspicions with someone more knowledgeable. Any feedback would be appreciated.

atiderko commented 8 years ago

Hi mikeodr,

you are right, the sync will be canceled at this place. But it has to by retried at the latest after 30 seconds by https://github.com/fkie/multimaster_fkie/blob/master/master_sync_fkie/src/master_sync_fkie/master_sync.py#L149 Does the sync node try to "Update ROS master list..." after Connection refused or Timeout errors?

mikeodr commented 8 years ago

I did not, but will try again. Going to re-test under controlled conditions.

mikeodr commented 8 years ago
[ERROR][1453149383.954695][/master_sync]: SyncThread[cpr000373.local] ERROR: Traceback (most recent call last):
  File "/opt/ros/indigo/lib/python2.7/dist-packages/master_sync_fkie/sync_thread.py", line 233, in _request_remote_state
    remote_state = remote_monitor.masterInfoFiltered(self._filter.to_list())
  File "/usr/lib/python2.7/xmlrpclib.py", line 1233, in __call__
    return self.__send(self.__name, args)
  File "/usr/lib/python2.7/xmlrpclib.py", line 1587, in __request
    verbose=self.__verbose
  File "/usr/lib/python2.7/xmlrpclib.py", line 1273, in request
    return self.single_request(host, handler, request_body, verbose)
  File "/usr/lib/python2.7/xmlrpclib.py", line 1301, in single_request
    self.send_content(h, request_body)
  File "/usr/lib/python2.7/xmlrpclib.py", line 1448, in send_content
    connection.endheaders(request_body)
  File "/usr/lib/python2.7/httplib.py", line 975, in endheaders
    self._send_output(message_body)
  File "/usr/lib/python2.7/httplib.py", line 835, in _send_output
    self.send(msg)
  File "/usr/lib/python2.7/httplib.py", line 797, in send
    self.connect()
  File "/usr/lib/python2.7/httplib.py", line 778, in connect
    self.timeout, self.source_address)
  File "/usr/lib/python2.7/socket.py", line 571, in create_connection
    raise err
timeout: timed out

[INFO][1453149402.852812][/master_sync]: Update ROS master list...
[INFO][1453149402.865279][/master_sync]: service 'list_masters' found on /master_discovery/list_masters

Yes following a timeout the timer seems to fire again. Should this occur and before the next sync would any current or new ROSTCP connections to already synced topics be effected in a negative way?

mikeodr commented 8 years ago

There seems to be different issues combined in network timeouts.

[WARN][1453151504.876332][/master_discovery]: socket error [http://172.20.9.10:11611]: Traceback (most recent call last):
  File "/opt/ros/indigo/lib/python2.7/dist-packages/master_discovery_fkie/master_discovery.py", line 320, in __retrieve_masterinfo
    timestamp, masteruri, mastername, nodename, monitoruri = remote_monitor.masterContacts()
  File "/usr/lib/python2.7/xmlrpclib.py", line 1233, in __call__
    return self.__send(self.__name, args)
  File "/usr/lib/python2.7/xmlrpclib.py", line 1587, in __request
    verbose=self.__verbose
  File "/usr/lib/python2.7/xmlrpclib.py", line 1273, in request
    return self.single_request(host, handler, request_body, verbose)
  File "/usr/lib/python2.7/xmlrpclib.py", line 1301, in single_request
    self.send_content(h, request_body)
  File "/usr/lib/python2.7/xmlrpclib.py", line 1448, in send_content
    connection.endheaders(request_body)
  File "/usr/lib/python2.7/httplib.py", line 975, in endheaders
    self._send_output(message_body)
  File "/usr/lib/python2.7/httplib.py", line 835, in _send_output
    self.send(msg)
  File "/usr/lib/python2.7/httplib.py", line 797, in send
    self.connect()
  File "/usr/lib/python2.7/httplib.py", line 778, in connect
    self.timeout, self.source_address)
  File "/usr/lib/python2.7/socket.py", line 571, in create_connection
    raise err
error: [Errno 110] Connection timed out

This error is with Master Discovery and not Sync thread. When this error occurred rostopic list shows that and synced topics disappeared.

atiderko commented 8 years ago
  1. Timeouts in master_sync: The timeouts in already synced topics are handled by ROS itself. After master_syncreceived topic information from remote host the topics are registered again if these topics were removed in meantime by ROS
  2. Timeout in master_discovery: This method is called after a remote master was discovered by multicast or unicast heartbeat messages AND BEFORE the discovered master is announced in ROS network. So no topics synchronization to the remote host should happen before if this timeout occurs. But master_discovery retries to get info after this timeout.

To clarify this issue "When this error occurred rostopic list shows that and synced topics disappeared." I need more log/scenario info

mikeodr commented 8 years ago

Sorry for leaving this open so long, these error prints were a bit misleading and just benign.