frankaemika / libfranka

C++ library for Franka research robots
https://frankaemika.github.io
Apache License 2.0
221 stars 147 forks source link

Communication Constraints Violation after some time into robot operation #123

Closed ruturajsambhusvt closed 1 year ago

ruturajsambhusvt commented 1 year ago

Hello, I am doing reinforcement learning on the robot and it demands long hours of continuous operation. After some time into the operation, I randomly get a communication constraints violation error and the robot stops. I confirmed the communication is functioning well as per all the tests mentioned in the documentation. I am also attaching the output of the ping results and communication test libfranka.

--- 172.16.0.2 ping statistics --- 10000 packets transmitted, 10000 received, 0% packet loss, time 10007ms rtt min/avg/max/mdev = 0.115/0.169/3.754/0.061 ms

####################################################### The control loop did not get executed 43 times in the last 10000 milliseconds! (lost 43 robot states)

Control command success rate of 9957 samples: Max: 1.00 Avg: 0.98 Min: 0.91 #######################################################

I am using Socket to communicate from Python based code to the libfranka code. I request you to advise me on resolving this issue.

marcbone commented 1 year ago

Your connection looks fine. You will never be able to achieve 100%. In hour long runs, communication constraints violation errors can maybe not always be avoided. However, you can reduce the chance that it happens by using the performance governor on your computer. https://frankaemika.github.io/docs/troubleshooting.html#disabling-cpu-frequency-scaling

ruturajsambhusvt commented 1 year ago

Hello, thank you for the response. I changed the performance governor to maximum frequency. It typically runs at 3.2 GHz (maximum is 3.9 GHz), could not get any better than this. The system is i7 16GB RAM 8 cores. The failure is random, sometimes it runs for 6 hours and sometimes throws an error in minutes. Is there no solution to this issue?

marcbone commented 1 year ago

I am normally not running a control loop for hours, but I guess crashing after 6 hours can maybe be attributed to the robot itself (we are sorry). However, a failure within minutes should not happen on a well configured setup. My guess would be that python is the limiting factor. I would try to switch the experiment to use C++ only (make sure to compile in Release mode!). You could try to modify the communication test to run indefinitely to see if switching to C++ will solve your problem before you start. Other things to try out would be using a different network adapter. An easy way would be to test some USB 3 to Ethernet dongles. Most of them work quite well, but some are really terrible. You could also try a different computer. The last thing to try would be to look into the various energy saving mechanisms of intel cpus and trying to disable all of them. IIRC setting the governor to performance does not disable all energy saving features. I think there is some stuff about C states and P states where you can tweak some kernel settings (some of them require recompilation of the kernel). But dont expect big differences from this. Also this is a huge time sink where one could spend weeks with.

ruturajsambhusvt commented 1 year ago

Hello, thank you for the elaborate response. It looks like the CPU was at play. I am using Arch Linux and switched to the LXDE desktop environment. I also made sure to just run the low-level code from the terminal and not use the PC for anything else. I did not get the error from then on.