Closed olivier-stasse closed 1 year ago
Reminded me of https://github.com/frankaemika/libfranka/issues/120.
Thanks @gavanderhoorn for pointing out this issue. I got the same graphs than frankaemika/libfranka#120 with another tool, and a student from Prague found the same with yet another test.
I suspect that the underlying protocol with the robot is the root of the problem.
Follow up: one of our colleague, and this is aligned with some feedback on other issues, said that the network card we are using may not be handling the UDP packets properly. We may need to have a NIC which is compatible with Ethercat.
I will have a look and provide feedback.
Unless FCI communication is based on ethercat, I'm unsure how "ethercat compatible" NICs (not sure what that means exactly) would help.
IIRC and IIUC, FCI uses a custom UDP protocol.
As far as I understood it this link gives a possible explanation: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/performance_tuning_guide/network-nic-offloads "Offloads should be used on high speed systems that transmit or receive large amounts of data and favor throughput over latency. Because using offloads greatly increases the capacity of the driver queue, latency can become an issue. An example of this would be a system transferring large amounts of data using large packet sizes, but is also running lots of interactive applications. Because interactive applications send small packets at timed intervals there is a very real risk that those packets may become 'trapped' in the buffer while larger packets in front of them are processed, causing unacceptable latency. "
And the implication is "NICs compatible with ethercat" don't use off-loading?
Tuning your network stack could certainly help, provided the other side (ie: the control box) uses a similarly tuned setup of course. That I wouldn't know.
But the general assertion that "ethercat compatible NICs" would be better is not necessarily one you can make, I believe.
Ok guys before you start buying a new networks card: For panda it can be expected to have some missed cycles. This should usually not be a huge problem. But we fixed this issue for our new FR3 robot. Basically if you get 98% success rate in the communication test on a Panda, everything is correct on your side. On the FR3 you should get 100%.
Isn't this more about the fact the timing of network packets doesn't appear to be deterministic (or at least: show a larger jitter than expected)?
Would the FR3 improvements be "back-portable" to the Panda?
Some feedback:
We changed the network cards connected to the robot from a Broadcomm to an Intel. The later is having a better driver than the former under Linux. But this changed nothing in terms of behaviors. (Thus I did not provide the exact NIC references)
We tried the following settings for the NIC offloads:
ethtool -K eth4 tso off
ethtool -K eth4 gro off
ethtool -K eth4 gso off
ethtool -K eth4 lro off
ethtool -K eth4 ufo off
ethtool -K eth4 sg off
with eth4
the device to which the robot is connected. No impact.
@marcbone, we finally changed our controller to take into account the time variation and that works. Good to know that this is fixed on your new robot, but for now we have no plan to buy another one. I will therefore close the issue.
@gavanderhoorn I agree that the "ethercat compatible NICs" term was probably nothing more than a hint from my colleague to have a look to package management and that the "ethercat compatible" part is probably not a necessity.
@gavanderhoorn I am sorry, these improvements will probably never be backported. Even though from a customers view there is not much difference between Panda and FR3, there are lots of changes under the hood, so we cannot easily offer an update.
@olivier-stasse I am glad that you were still able to make it work. I would also be interested in what your application is (if it is not confidential), that this was a problem. It helps us to better understand what you guys need. Also I am always interested on what you work on, especially if you push the robot to its limits :sweat_smile:
@marcbone In this case we want to deploy a framework that plan and execute tasks by exploring a graph of constraints Mirabel, TRO, 2022. We applied it to the humanoid robot TALOS (https://www.youtube.com/watch?v=lWKJoFPZVLY) at 1KHz, the mobile manipulator Tiago (https://www.youtube.com/watch?v=yew7CZMcsyc) at 250 KHz, and a UR-10 (https://peertube.laas.fr/w/hYitAMe96DP9iPuBjSYCc8) at 250 Hz. ros_control is key to lower the transfer from one platform to another. Each of them is using a RT_PREEMPT patch and we did not run into such time differences.
Dear Franka Emika Team, First thanks a lot for having your library open-source, and for providing a ros_control interface. This is very much appreciated.
We tested our control architecture in one Panda robot acquired in 2019. We found some important time variation when being called by the
update
function of thecontroller_manager
in your franka_ros_control node. In average we found a very good 0.001 ms but with numerous times where the time difference was 0.02 ms or 5ms.Reading your code, i did a fork and set the scheduler to SCHED_FIFO to make sure that
franka_control_node
was running in real-time with the max priority but it changed nothing.Could you confirm that this is supposed to be normal and we need to write our controller to take into account this time variation ?
You can find the tests I used here.
It takes a sliding windows of 2000 time diff measurements and average them. In addition every 2000 iterations it detects the min and max over this window. The output is the following:
To avoid any issue the test was done using moveit and the JTC controller to avoid mixing problems with our control architecture.