RethinkRobotics / baxter

Baxter Research Robot SDK
www.rethinkrobotics.com/sdk
BSD 3-Clause "New" or "Revised" License
71 stars 64 forks source link

Network issues with the Baxter Research Robot #177

Open alecive opened 7 years ago

alecive commented 7 years ago

Hello, since three days ago we are experiencing some difficult-to-reproduce, difficult-to-track network issues with our Baxter.

What happens in practice is that after a while some topics stop being published, and the robot is not responding to even the simplest command (e.g. tuck). Also, rebooting both the Baxter and the machine connected to it does not change much: after 5 minutes the issue reappears.

After hours in trying to understand the reason for this, we discovered that the issue might be related to the rosmaster that is not closing sockets properly, leaving them in CLOSE_WAIT state. Also, when this happens rosmaster starts to use more than 100% of one core on the baxter machine, which seems strange.

The issue seems to be well-known to the ROS community, but although I have seen many issues on that, I don't really understand how to fix the problem.

Some useful links:

Please tell me if I can help you debug it in any way.

@rethink-imcmahon sorry for tagging directly you, but I saw your comment in one of the issues I linked above, and I thought you may have a quick solution to debug the issue.

IanTheEngineer commented 7 years ago

No worries on tagging me. I have not dug into any of the issues you've called out here, but I have helped many, many Baxter customers properly network their computers to their robots. I would first make sure the networking layer is rock solid - make sure to follow our Networking tutorial, with key points being:

alecive commented 7 years ago

Thank you @IanTheEngineer for being so quick in your reply. So what we basically had before was a direct connection between the development workstation and the Baxter robot through the second network card the workstation is equipped with.

We proceeded to unplug the Baxter from the workstation and plug it to the router, changed the baxter.sh params and rebooted both the Baxter and the workstation.

It seems that the issue has disappeared, at least from our quick testing. The problem we have now is that the input data network speed is capped at 11MB/s, which renders most of our code unusable. When we were connecting directly to the robot, we had a speed always higher than 50MB/s. Whilst I am aware that a better/faster/newer router would increase the network speed in such a configuration, this does not explain (at least to me) the reason for the problem, and why it appeared only now. What do you think the issue originates from? From here, it seems that rebooting the machines helps anyway because you cleanup the number of sockets in CLOSE_WAIT state.

Anyway, we will try to keep debugging the issue in both configurations (with or without the router in between the two machines), and we'll let you know. It takes time to reproduce the issue and I am still not sure if the "router fix" helped or not.

alecive commented 7 years ago

Further investigation: we went back to the "direct connection" configuration.

Again, rosmaster usage comes back up to more than 100%, and the issue shows up again even though the number of sockets in CLOSE_WAIT state seems low (about 10). We experience a big number of sockets in TIME_WAIT state, though.

After closing our launch files, what happens is that rosmaster usage stays high for ~5 minutes, until all these sockets exit from their TIME_WAIT state. When this happens, rosmaster usage goes back down to 0.3%, and we regain control over the network and the Baxter.

alecive commented 7 years ago

The number of sockets in TIME_WAIT state keeps increasing over the time after launching our launch file. After 30 seconds usage, it fluctuates around 3000 and it stays there. Closing the launch files starts reducing those sockets until they go back to 0 after ~5 minutes.

IanTheEngineer commented 7 years ago

It is entirely possible that this CLOSE_WAIT issue is affecting Baxter's roscore. This is really useful debugging info you're collecting here, and I'd recommend adding it to the ticket you've linked so that the ros_comm maintainers have more context for the bug. In the mean time, I'd recommend getting a solid router for around $50 to mitigate the issue.

alecive commented 7 years ago

We'll do that. In the meanwhile, I am not so sure if it's worth upgrading the whole system to kinetic. It seems that ROS support for older versions is not that great, and an upgrade might help.

I am following the issue here: when do you think the QA team will be able to test the Baxter with kinetic? Is there an ETA for that? I would like to stick with the official channels for the baxter robot.

alecive commented 7 years ago

@IanTheEngineer do you have any suggestion about the best router we could by to satisfy our bandwidth hunger? I can obviously look for a router by myself, but maybe Rethink has a list of suggested/recommended hardware in this regard.

alecive commented 7 years ago

@IanTheEngineer we finally bought a new router for our setup.

After quick testing, now the max read/write speeds allowed by our system hover around 100MB/s, that much bigger than our needs and importantly much better than the 11MB/s allowed by our previous router (now the bottleneck is probably the hard drive).

Above all, the problem seems to be gone now, so I am going to close the issue. We will keep testing the new setup in the following weeks, and we'll re-open this issue if needed.

For future reference, here is a link for purchasing the exact model we bought: https://www.amazon.com/dp/B00QGOQ2BA/ref=psdc_300189_t1_B00HEX851C?th=1

Thank you for the support! Cc @omangin

alecive commented 7 years ago

@IanTheEngineer reopening because what we believe are network issues are still present. We updated the router, I fixed a bug in ros_comm that was causing some issues (see here), and now I don't have any ROS error I could try to use in order to understand what is going on.

The behavior we have right now is that everything is fine, until at some point one of the following happens:

The only way I have to fix them is to reboot the robot altogether. FYI:

[baxter - http://baxter.local:11311] scazlab@baxterserver:~/ros_devel_ws$ rosparam get /rethink/software_version 
1.2.0.57
[baxter - http://baxter.local:11311] scazlab@baxterserver:~/ros_baxter_ws/src/baxter (master)$ git describe
v1.2.0

Also, I just reinstalled the baxter workstation from scratch with indigo, and my network setup is the recommended, ie this one: img