fkie / multimaster_fkie

ROS stack with FKIE packages for multi-robot (discovering, synchronizing and management GUI)
BSD 3-Clause "New" or "Revised" License
267 stars 107 forks source link

Debugging syncing issues #173

Closed akifh closed 1 year ago

akifh commented 2 years ago

Hi @atiderko , thank you for your great work on mm. I have a problem and if possible will ask you how to debug it.

I have two machines synced via multimaster (master_disc. and master_sync) in unicast mode. Everything works as expected for a while. But after a point in time, one machine stops receiving a message in a topic from the other machine. This may be caused with a WiFi reconnection (due to poor signal strength), I'm not sure about that but we lose synchronization.

It only resumes receiving messages once we restart the node publishing to topic. Once it is restarted, it starts receiving again on the other machine.

It seemed to me like a reconnection issue but I'm not sure whether it is related with multimaster (master_sync in particular) or ROS itself.

Can you guide me about debugging of this issue?

Thanks for your help, Best

atiderko commented 2 years ago

Hi @akifh , thank you for using the multimaster.

I have also noticed that the Python nodes/topics are not reconnecting. I'm also not sure if the connection loss is in roslog. To address this issue, master_sync triggers a reconnect if the other host was offline.

To detect short disconnects you can try to set the parameter "heartbeat_hz" of the master_discovery to e.g. 2Hz: heartbeat_hz:=2

akifh commented 2 years ago

Thanks for the reply.

I tried some other tests. With heartbeat_hz increased to 2, problem still occurs. Nodes are C++ nodes, so I'm not sure what it is related with.

I may explain the situation better with some referencing and further information, In Machine A, I have Node 1, In Machine B, I have Node 2,

Machines are syncing in unicast mode. Node 1 is publishing a message constantly to Node 2. At some point, Node 2 says it is not receiving messages anymore from Node 1.

At this time,

I hope this helps to diagnose the issue, if it is related with multimaster. Best

atiderko commented 2 years ago

I would next check to see if master_sync does anything before the connection between topics disappears. Launch each master_sync in a terminal and set the log level to debug. (You can use _log_level:=DEBUG parameter and start the sync node twice).

If the connection between the nodes disappears and master_sync shows no activity, then I would look for the problem in ROS, otherwise you have to look what master_sync did.

akifh commented 2 years ago

Thanks for the guidance. I've started to believe that this is a ROS issue, not MM, yet I will check thoroughly. When we do our tests, I will notify here.

akifh commented 1 year ago

It was resolved as an issue in roscpp. I'm closing this. Thanks.