fkie / multimaster_fkie

ROS stack with FKIE packages for multi-robot (discovering, synchronizing and management GUI)
BSD 3-Clause "New" or "Revised" License
272 stars 106 forks source link

Master_Sync cannot find Master_discovery node #106

Closed jbhoffman613 closed 4 years ago

jbhoffman613 commented 5 years ago

Describe the bug After running the master_discovery node on the fetch robot, we expected the master_sync node to automatically see that master_discovery node is running, but it does not. We are also running the master_discovery node and the master_sync on the workstation and other robots, and the master_sync node discovers the master_discovery node without problem.

To Reproduce We are following the official technical guide to set up the system. Also, when we bring up the master_discovery node other machines, the fetch robots says ROS masters obtained from /master_discovery/list_masters. The master_discovery node on the workstation says the same thing.

When we then start the sync node (as stated in the guide), it says the Master_discovery node appear not to running. Wait for topic with type 'MasterState. It continually throws this error. This does not happen on any other robot/workstation.

Expected behavior We expect the sync node to automatically discover the master_discovery node like the workstation.

Screenshots Screenshot from 2019-06-06 11_51_55

catkin workspace (please complete the following information):

Additional context On the workstation and on the other bots, we noticed that they have the service /master_sync/get_sync_info The fetch robot does not have this service. (I don't know if this is relevant, but I thought I may add it.)

It is related to: https://github.com/fkie/multimaster_fkie

atiderko commented 5 years ago

On startup master_sync node waits for topic /master_discovery/changes with type MasterState is availabe. The publisher of this topic is created by master_discovery. Only if this topic was found the service /master_sync/get_sync_info will be created. At this point it is an expected behavior of master_sync. The question is why master_discovery does not create the expected topic!

Can you check the output of master_discovery, please. master_discovery should also advertises following topics:

$rostopic list | grep master_discovery
/master_discovery/changes
/master_discovery/linkstats

You can also run master_discovery with

rosrun master_discovery_fkie master_discovery _log_level:=DEBUG

to see debug output if needed (Note: start master_discovery twice if you see no debug output after first run)

jbhoffman613 commented 5 years ago

Thank you so much for the response! I have shared the output of the master_discovery and the the topic list. Is there any other debug info I can share with you?

Master discover is advertising these topics (screenshot): image

And here is the output of the master_discovery when we run it: image

(TEXT VERSION): tian@fetch32:~$ rosrun master_discovery_fkie master_discovery _mcast_group:=224.0.0.251 _log_level:=DEBUG [DEBUG] [WallTime: 1559919730.388528] init_node, name[/master_discovery], pid[3208] [DEBUG] [WallTime: 1559919730.388708] binding to 0.0.0.0 0 [DEBUG] [WallTime: 1559919730.388885] bound to 0.0.0.0 50463 [DEBUG] [WallTime: 1559919730.389143] ... service URL is rosrpc://192.168.0.113:50463 [DEBUG] [WallTime: 1559919730.389271] [/master_discovery/get_loggers]: new Service instance [DEBUG] [WallTime: 1559919730.390337] ... service URL is rosrpc://192.168.0.113:50463 [DEBUG] [WallTime: 1559919730.390468] [/master_discovery/set_logger_level]: new Service instance [INFO] [WallTime: 1559919730.394991] ROS Master URI: http://192.168.0.113:11311

[INFO] [WallTime: 1559919730.401620] Robot hosts: [] [INFO] [WallTime: 1559919730.401706] Approx. mininum avg. network load: 1.36 bytes/s [INFO] [WallTime: 1559919730.405354] Start RPC-XML Server at ('0.0.0.0', 11611) [INFO] [WallTime: 1559919730.405604] Subscribe to parameter /roslaunch/uris [INFO] [WallTime: 1559919730.408792] + Bind to specified unicast socket @(192.168.0.113:11511) [DEBUG] [WallTime: 1559919730.408961] Ucast bind to: (192.168.0.113:11511) [DEBUG] [WallTime: 1559919730.409156] mgroup: 224.0.0.251 [DEBUG] [WallTime: 1559919730.409265] interface : 192.168.0.113 [DEBUG] [WallTime: 1559919730.409372] inet: (2, 1, 6, '', ('192.168.0.113', 0)) [INFO] [WallTime: 1559919730.409533] Create multicast socket at ('224.0.0.251', 11511) [DEBUG] [WallTime: 1559919730.410580] node[/master_discovery, http://192.168.0.113:39861/] entering spin(), pid[3208] [DEBUG] [WallTime: 1559919730.559197] Send current state to group 224.0.0.251:11511 [DEBUG] [WallTime: 1559919730.627435] Received a NEW heartbeat from ('localhost', 11511) via MULTICAST socket [INFO] [WallTime: 1559919730.627656] Detected master discovery: http://localhost:11611 [DEBUG] [WallTime: 1559919730.728058] Get additional connection info from http://localhost:11611 [DEBUG] [WallTime: 1559919730.729164] Got [1559919730.510572910, http://fetch32:11311/, fetch32, /master_discovery] from http://192.168.0.113:11611/ [INFO] [WallTime: 1559919730.729554] Added master with ROS_MASTER_URI=http://fetch32:11311/ [DEBUG] [WallTime: 1559919730.729792] ... service URL is rosrpc://192.168.0.113:50463 [DEBUG] [WallTime: 1559919730.729938] [/master_discovery/list_masters]: new Service instance [DEBUG] [WallTime: 1559919730.730995] ... service URL is rosrpc://192.168.0.113:50463 [DEBUG] [WallTime: 1559919730.731236] [/master_discovery/refresh]: new Service instance [DEBUG] [WallTime: 1559919730.741510] Received a NEW heartbeat from ('192.168.0.110', 11511) via MULTICAST socket [INFO] [WallTime: 1559919730.741702] Detected master discovery: http://192.168.0.110:11611 [DEBUG] [WallTime: 1559919730.742142] Received a multicast request for a state update from 192.168.0.110 [DEBUG] [WallTime: 1559919730.742349] Send current state to group 224.0.0.251:11511 [DEBUG] [WallTime: 1559919730.757811] Received a heartbeat from ('192.168.0.110', 11511) via MULTICAST socket [DEBUG] [WallTime: 1559919730.789484] Received a heartbeat from ('localhost', 11511) via MULTICAST socket [DEBUG] [WallTime: 1559919730.842181] Get additional connection info from http://192.168.0.110:11611 [DEBUG] [WallTime: 1559919730.850891] Got [1559919682.869565964, http://192.168.0.110:11311/, 192.168.0.110, /master_discovery] from http://192.168.0.110:11611/ [INFO] [WallTime: 1559919730.851155] Added master with ROS_MASTER_URI=http://192.168.0.110:11311/ [DEBUG] [WallTime: 1559919731.410899] Send current state to group 224.0.0.251:11511 [DEBUG] [WallTime: 1559919731.411220] Send requests while init 1/3 [DEBUG] [WallTime: 1559919731.411503] Skipped send request to mcast group 224.0.0.251:11511. Last send was 0.67sec ago, allowed 1.00 [DEBUG] [WallTime: 1559919731.411731] Set timer to send heartbeat in 50.00 sec [DEBUG] [WallTime: 1559919731.453897] Received a heartbeat from ('localhost', 11511) via MULTICAST socket [DEBUG] [WallTime: 1559919731.577225] Send current state to group 224.0.0.251:11511 [DEBUG] [WallTime: 1559919731.618539] Received a heartbeat from ('localhost', 11511) via MULTICAST socket [DEBUG] [WallTime: 1559919731.832573] Received a heartbeat from ('192.168.0.110', 11511) via MULTICAST socket [DEBUG] [WallTime: 1559919747.892098] Send current state to group 224.0.0.251:11511 [DEBUG] [WallTime: 1559919747.951971] Received a heartbeat from ('localhost', 11511) via MULTICAST socket [DEBUG] [WallTime: 1559919751.990338] Send current state to group 224.0.0.251:11511 [DEBUG] [WallTime: 1559919752.017024] Received a heartbeat from ('localhost', 11511) via MULTICAST socket [DEBUG] [WallTime: 1559919757.104603] Send current state to group 224.0.0.251:11511 [DEBUG] [WallTime: 1559919757.131462] Received a heartbeat from ('localhost', 11511) via MULTICAST socket [DEBUG] [WallTime: 1559919779.541315] Send current state to group 224.0.0.251:11511 [DEBUG] [WallTime: 1559919779.597963] Received a heartbeat from ('localhost', 11511) via MULTICAST socket [DEBUG] [WallTime: 1559919780.762059] Received a heartbeat from ('192.168.0.110', 11511) via MULTICAST socket [DEBUG] [WallTime: 1559919780.847693] Received a multicast request for a state update from 192.168.0.110 [DEBUG] [WallTime: 1559919780.848006] Send current state to group 224.0.0.251:11511 [DEBUG] [WallTime: 1559919780.848307] Received a heartbeat from ('192.168.0.110', 11511) via MULTICAST socket [DEBUG] [WallTime: 1559919780.912002] Received a heartbeat from ('localhost', 11511) via MULTICAST socket [DEBUG] [WallTime: 1559919781.412166] Send current state to group 224.0.0.251:11511 [DEBUG] [WallTime: 1559919781.412550] Send requests while init 2/3 [DEBUG] [WallTime: 1559919781.412897] Skipped send request to mcast group 224.0.0.251:11511. Last send was 0.56sec ago, allowed 1.00 [DEBUG] [WallTime: 1559919781.413181] Set timer to send heartbeat in 50.00 sec [DEBUG] [WallTime: 1559919781.476384] Received a heartbeat from ('localhost', 11511) via MULTICAST socket [DEBUG] [WallTime: 1559919830.744708] Received a heartbeat from ('192.168.0.110', 11511) via MULTICAST socket [DEBUG] [WallTime: 1559919831.413637] Send current state to group 224.0.0.251:11511 [DEBUG] [WallTime: 1559919831.413910] Send requests while init 3/3 [DEBUG] [WallTime: 1559919831.414028] Send request to mcast group 224.0.0.251:11511 [DEBUG] [WallTime: 1559919831.414217] Set timer to send heartbeat in 50.00 sec [DEBUG] [WallTime: 1559919831.459658] Received a heartbeat from ('localhost', 11511) via MULTICAST socket [DEBUG] [WallTime: 1559919831.459913] Received a multicast request for a state update from localhost [DEBUG] [WallTime: 1559919831.460191] Send current state to group 224.0.0.251:11511 [DEBUG] [WallTime: 1559919831.467630] Received a heartbeat from ('localhost', 11511) via MULTICAST socket [DEBUG] [WallTime: 1559919831.484292] Received a heartbeat from ('192.168.0.110', 11511) via MULTICAST socket [DEBUG] [WallTime: 1559919880.759044] Received a heartbeat from ('192.168.0.110', 11511) via MULTICAST socket [DEBUG] [WallTime: 1559919881.414789] Send current state to group 224.0.0.251:11511 [DEBUG] [WallTime: 1559919881.415188] Set timer to send heartbeat in 50.00 sec [DEBUG] [WallTime: 1559919881.474006] Received a heartbeat from ('localhost', 11511) via MULTICAST socket [DEBUG] [WallTime: 1559919930.743513] Received a heartbeat from ('192.168.0.110', 11511) via MULTICAST socket

atiderko commented 5 years ago

hmm... perhaps it is something with host resolving in https://github.com/fkie/multimaster_fkie/blob/indigo-devel/master_discovery_fkie/src/master_discovery_fkie/interface_finder.py#L121

As hotfix (1) you can try to comment out and get_hostname(val) == get_hostname(masteruri)

or (2) replace the line https://github.com/fkie/multimaster_fkie/blob/indigo-devel/master_sync_fkie/src/master_sync_fkie/master_sync.py#L71 by topic_names = [/master_discovery/changes]

If you can fix it by hotfix (1) we can try to find the reason why it fails.

jbhoffman613 commented 5 years ago

Thank you! We have done hot fix (1) and it has removed the warning message. We also now see that the fetch robot has the service:

$ rosservice list | grep sync
/master_sync/get_loggers
/master_sync/get_sync_info
/master_sync/set_logger_level

Everything looks good except on other machines we see the INFO message: ROS masters obtained from '/masters_discovery/list_masters': ['fetch', 'machine1', 'machine2']. On the fetch robot, we never see this message in the sync node. Is that message just for aesthetics or does that tell us that there is another problem?

Here is a screenshot. The middle screen is the sync node of the fetch robot: image

atiderko commented 5 years ago

sorry, there is a second line with get_hostname(val) == get_hostname(masteruri) in https://github.com/fkie/multimaster_fkie/blob/indigo-devel/master_discovery_fkie/src/master_discovery_fkie/interface_finder.py#L209

I wonder why ROS Master and ROS node report URI with different hostname. It would be great if you add these two lines

print("nodeuri: %s" % val)
print("masteruri: %s" % masteruri)

before get_hostname(val) == get_hostname(masteruri).

Thank you for help debugging!

tx0619s commented 5 years ago

I have added the print statement in the two spots, and I have commented out get_hostname(val) == get_hostname(masteruri) in https://github.com/fkie/multimaster_fkie/blob/indigo-devel/master_discovery_fkie/src/master_discovery_fkie/interface_finder.py#L209

Here is the new code in /src/master_discovery_fkie/interface_finder.py. The code in lines 117 to 124 is:

# get the URI of the publisher node
for n in l:
  code, msg, val = master.lookupNode(rospy.get_name(), n)
  # only local publisher will be tacked
  print("nodeuri: %s" % val)
  print("masteruri: %s" % masteruri)
  if code == 1: # and get_hostname(val) == get_hostname(masteruri):
    result.append(topic)

The code in lines 206 to 214 is:

# search for a service
for srv, providers in srvs:
  if srv.endswith(name):
    # only local service will be tacked
    code, msg, val = master.lookupService(rospy.get_name(), srv)
    print("nodeuri: %s" % val)
    print("masteruri: %s" % masteruri)
    if code == 1: # and get_hostname(val) == get_hostname(masteruri):
      result.append(srv)](url)

Here is a screenshot of the sync node output: image

I hope this helps you debug it! Thank you!

atiderko commented 5 years ago

I can not reproduce this behavior on my machines. So I added a new parameter check_host to the master branch. I can be a work around for all who uses the latest multimaster version with same problems.