Masters bouncing offline/online & resync on reconnect

deng02 commented 7 years ago

We've been doing some stress testing using the most recent version of multimaster and have identified an issue that is causing some of our masters to bounce between offline and online (which in turn is triggering the resync on reconnect feature). After some investigation we believe we understand the cause and want to get some feedback on what we are seeing.

1) The bouncing behavior appears to be related to the code that removes offline masters. When a master has an unanswered heartbeat request, another request is then sent to that master plus all of the others. This has the affect of incrementing all the other masters "unanswered count" by +1. Because the callback code that would set that count to 0 uses the same lock as the removal code, the unanswered count can keep increasing by +1 on each iteration. In the case where there are at least 5 more masters after the one that had the unanswered count, this can increase the count to 5 unanswered requests for a master by the time we reach it in the loop and it is set offline. When the removal code exits the callback runs and sets the master back online (since it wasn't really offline). We see this occur repeatedly. The problem seems to be the request being sent to all other masters, not just the one with an unanswered request. Is there a particular reason this is done? Would you be open to having the request sent only to the master in question and not to the entire group in this case?

2) Because the master bonces offline/online we see the resync on reconnect code running frequently, which we believe has caused some of the drop of messages we've been seeing. We addressed the problem by setting the 'resync_on_reconnect' param to False and everything is running fine. However we'd like to understand what we are losing by doing that. What particular problem does that feature address?

Thanks.

atiderko commented 7 years ago

Thank you for your feedback!

I have some question to 1):

did you use node_manager to start master_discovery? or
did you start master_discovery using launch file with your own parameter? (It is because the default ~heartbeat_hz parameter of master_discover is changed by node_manager. If the ~heartbeat_hz is lower then 0.3 no quality calculation is done and therefore the remote master are set offline only after 5 unanswered requests. The requests are send:
- for each hosts in ~robot_hosts parameter
- for all hosts (multicast) while initialization 3x
- for all hosts (multicast) if two or more master_discovery are running on the same host (should normally no happen)
- for each host no heartbeats within 60 sec are received

"The problem seems to be the request being sent to all other masters, not just the one with an unanswered request." - If this occurs in other cases as described above, it is a bug :-/ )

If ~heartbeat_hz is greater then 0.3 (master_discovery is started by node_manager with default values) : change the ~timeout_factor to 7, e.g. rosparam set /master_discovery/timeout_factor 7 and test master_discovery again in your scenario, please

Some comment to 2): The python implementation in ros_comm currently does not support reconnects on timeouts and some other connection problems. If you have for example connection problems longer then 2.5 minutes the connection for python subscribers will be dropped. The resync_on_reconnect parameter address this problem and force the reconnection (it does not check the interruption duration). If you do not have so long connection interrupts, you can set this parameter to False.

Please inform me if ~timeout_factor solves the problem or not. If not, set

rosparam set /master_discovery/log_level DEBUG

restart master_discovery and upload the log of the master_discovery which bounce between offline and online.

Thanks!

deng02 commented 7 years ago

Thanks for the response! The clarification on the "resync on reconnect" helps a great deal.

Re the bouncing: To answer your questions 1) we're starting master_discovery using our own launch file and 2) heartbeat hertz is actually below 0.3 so no quality calculation is being done. In our scenario, hosts are only being set offline because we are exceeding 5 heartbeat requests.

Based on what you said above, I realize I forgot to mention that we are in fact running more than one master discovery on the same host. For our stress testing purposes need to simulate a large number of robots, and to do that we are spinning up N robots per machine, each of which have their own master discovery node. So we are hitting the case where requests are sent "for all hosts (multicast) if two or more master_discovery are running on the same host".

Further to that, the trigger that is causing the requests to be sent is because a host has at least one unanswered request. I did not see that listed above in your list so maybe this is a bug? It appears in master_discovery.py line 806. Eventually when this is called enough times in a loop we hit the 5 unanswered requests on multiple hosts and they are set offline.

I've attached the master discovery log to illustrate this behavior (the lines that say "TRACE" are extra log lines that were added to try and diagnose the issue). If you look at line 47 you can see the beginning of a cycle in "remove_offline_hosts" where each master has 1 unanswered request. Each iteration through the loop causes it to send requests to the master as well as all others and so the count increases by 1. When we hit dev-vm-ros-01.local_12309 in line 96 it gets set offline.

master_discovery-08.log.gz

1) If multimaster isn't supposed to be sending requests when there is at least 1 unanswered requests, then removing this solves our problem.

2) If it is supposed to, then what we're wondering if why - when a specific masters has an unanswered heartbeat request or hasn't received one in the last 60 seconds - heartbeat requests are sent to all masters running on the same host? When sending out regular heartbeat requests it makes sense to do so for all hosts but when dealing with the unanswered request for a specific master, shouldn't we only talk to that one? Otherwise all others get penalized with an increment in their unanswered requests and can get set offline.

atiderko commented 7 years ago

Thank you for the log! It helps me to reproduce the behaviour. I added a fix to the master branch. It reduces the multicast requests and fix the request counter. It would be great if you test this in your scenario!

Thanks!

deng02 commented 7 years ago

That fixed it: the bouncing no longer occurs in my test scenario with the current master branch. Thank you! If we could get a tag with this commit in it we'll rev our version.

atiderko commented 7 years ago

Thank you for testing and reporting!

I created a release tag 0.5.8

fkie / multimaster_fkie

Masters bouncing offline/online & resync on reconnect #46