fkie / multimaster_fkie

ROS stack with FKIE packages for multi-robot (discovering, synchronizing and management GUI)
BSD 3-Clause "New" or "Revised" License
268 stars 108 forks source link

Detected master does not get added or synchronized #50

Closed deng02 closed 7 years ago

deng02 commented 7 years ago

We are running into an issue where a ROS master will get detected by another master but does not get added (or synchronized).

For example, sometimes we get the following in our master discovery logs (for example purposes, we'll say this is robot19): ` [rosout][INFO] 2016-10-14 14:10:44,147: Detected master discovery: http://192.168.66.20:11611

[rosout][INFO] 2016-10-14 14:10:44,159: Detected master discovery: http://192.168.66.21:11611

[rosout][INFO] 2016-10-14 14:10:44,161: Detected master discovery: http://192.168.66.22:11611

[rosout][INFO] 2016-10-14 14:10:44,162: Detected master discovery: http://192.168.66.23:11611

[rosout][INFO] 2016-10-14 14:10:44,433: Added master with ROS_MASTER_URI=http://robot21.local:11311/

[rosout][INFO] 2016-10-14 14:10:44,439: Added master with ROS_MASTER_URI=http://robot22.local:11311/

[rosout][INFO] 2016-10-14 14:10:44,439: Added master with ROS_MASTER_URI=http://robot23.local:11311/ ` As you can see robot20 (192.168.66.20) was detected but never added. Its topics are never pulled in and it doesn't shows up in the /master_discovery/list_masters output.

I believe I know what the issue is but this is just a theory based on code analysis.

When a remote ROS master is detected and a new DiscoveredMaster object is created, it will run DiscoveredMaster::__retrieve_masterinfo() which in turns will execute the monitor URI XML RPC call masterContacts, which in turn calls MasterMonitor::getMasterContacts(). If this happens to be run after the master monitor has started but before it has actually set it's state (self.__master_state), the master URI, name, discoverer name and monitor URI will be returned with a timestamp of 0. As a result, the code that sets the DiscoveredMaster mastername and masteruri isn't executed and they remain None (master_discovery.py:341-370). It also doesn't cause the "Added" message to be logged which explains why we see "Detected" but not "Added". Furthermore, since neither the mastername nor the masteruri is set, it is never included in the /master_discovery/list_masters output (master_discovery.py:1045), and when a heartbeat is added for that remote ROS master it never generates a state change (master_discovery.py:174), which also means it never gets synchronized. The end result is the remote ROS master is invisible.

Unfortunately we haven't been running logging at DEBUG level, so I don't have the output of that XML RPC call to confirm this theory. I will look at changing that but in the meantime, would like to get your feedback on whether this seems like the likely culprit.

If it is then I believe the solution is very simple: retry the __retrieve_master_info() function when the timestamp is 0 on the assumption that on the next go, the remote master monitor will have completed it's state update.

atiderko commented 7 years ago

Hi,

thank you for detailed description of the possible problem. I added some warnings and also reconnects. I hope it solves the problem or shows why the master was not added after MasterMonitor::getMasterContacts()

regards

deng02 commented 7 years ago

Awesome - I can confirm that fixed the issue. Thank you!

FYI I was able to repro by hacking master_discovery.py files on one of my robots so that the check ROS timer started after the heartbeat i.e. master_discovery.py:579 was changed to "self._timer_ros_changes = threading.Timer(2.0, self.checkROSMaster_loop)". Starting a ROS master with unchanged code, and then starting the ROS master with the hacked code triggered your fix.

atiderko commented 7 years ago

Thank you for testing of this fix and also for the hint with self._timer_ros_changes = threading.Timer(2.0, self.checkROSMaster_loop)!

I let this issue still open because it seems to be a problem on start of the master_discovery. It should not report a zero timestamp or send heartbeat while no local MasterState is available.

deng02 commented 7 years ago

No problem. Would it be possible to get a tag that includes this fix so we can get that into our build?

Admittedly my repro is a bit fake. It may be possible that the UDP heartbeat is arriving at a ROS master A from an earlier execution of ROS master B, rather than it's sending a heartbeat before the state is ready. In the rare times I have managed to catch this situation, it is usually couple with a network outage. I could potentially see a situation where ROS master B is running at 1:00:00, sends a heartbeat, at 1:00:01 terminates due to some other issue, then starts up again at 1:00:02. If the heartbeat at 1:00:01 happens to reach ROS master A at 1:00:2 and ROS master B XML RPC server is up but checkState hasn't run yet, it might be the case this is what is triggering the query before MasterState is ready. But that's theory. If I happen to find evidence of this happening again and can gather more information I will let you know.

deng02 commented 7 years ago

Nevermind what I said above! I realize now that what you're saying is my repro shouldn't have worked.

atiderko commented 7 years ago

ok, I added a new tag 0.6.0-1

deng02 commented 7 years ago

So I figured out what is triggering the heartbeat when no master state is available. In our setup the HEARTBEAT_HZ is 0.02 and as a result, there are 3 requests sent on startup. Those requests do not rely on master state so they are being sent, which in turn are resulting in the detection on the other end.

deng02 commented 7 years ago

Can we also get a version on the multimaster_fkie-release in indigo? I realize this is a minor but this bug is proving to be a blocker for us when we hit it.

atiderko commented 7 years ago

thank you for the debugging! I added a new fix; the requests are now sent and replied only if own state is available. In my tests it works and I hope in yours, too.

I am also created a new version 0.6.1