Closed oceanusxiv closed 4 years ago
@eric1221bday The test you are running is using neither UDP nor Shared Memory.
We've been running this for several hours, and the results showed high dispersion. Some times UDP was better, some Shared Memory was better. It was after a while that we realized the test runs on a single process.
Fast RTPS has an intra-process mechanism that short-circuits the transports. We have other tests that measure latency between several processes on the same machine, and those show quite better results for Shared Memory than for UDP
@MiguelCompany That's kind of weird then, something I did that I didn't mention here was running this test with --use_ipc 1
, on ROS end enabling intra-process transport, the performance is 10x better in that scenario, which is why I assumed that --use_ipc 0
does not use intra-process transport. In that case I'm not sure where the latency disparity between --use_ipc 1
and --use_ipc 0
is coming from, if they're both using intra-process transport, unless the ROS mechanism is truly giving such massive overheads as a 10x latency increase.
EDIT: FYI here are the results when running ros2 run performance_test_factory simple_pub_sub_main --subs 5 --pubs 1 --duration 10 --use_ipc 1 --msg_type stamped8mb
for me:
node topic size[b] received[#] late[#] too_late[#] lost[#] mean[us] sd[us] min[us] max[us] freq[hz] duration[s]
node_0 topic_5 8388608 100 0 0 0 110 26 49 355 10 10
node_1 topic_5 8388608 100 0 0 0 120 28 55 391 10 10
node_2 topic_5 8388608 100 0 0 0 124 29 59 401 10 10
node_3 topic_5 8388608 100 0 0 0 128 29 62 405 10 10
node_4 topic_5 8388608 100 0 0 0 132 29 66 409 10 10
System total:
received[#] mean[us] late[#] late[%] too_late[#] too_late[%] lost[#] lost[%]
500 123 0 0 0 0 0 0
EDIT 2: With the results below it seems like the latency difference is indeed just the overhead of the handoff from ROS to the RMW.
Alright I ran a couple more experiments. This time separating the runs so the pubs and subs are in different processes, here are the results.
For each of the test, ros2 run performance_test_factory publisher_nodes_main --subs 5 --pubs 1 --duration 10 --use_ipc 0 --msg_type stamped8mb
is run first on another terminal.
With UDP ros2 run performance_test_factory subscriber_nodes_main --subs 5 --pubs 1 --duration 10 --use_ipc 0 --msg_type stamped8mb
node topic size[b] received[#] late[#] too_late[#] lost[#] mean[us] sd[us] min[us] max[us] freq[hz] duration[s]
node_0 topic_5 8388608 27 2 25 66 44769 448 44321 45216 10 15
node_1 topic_5 8388608 27 2 25 66 46660 661 45999 47320 10 15
node_2 topic_5 8388608 27 2 25 66 47464 662 46802 48125 10 15
node_3 topic_5 8388608 27 2 25 66 44363 799 43564 45161 10 15
node_4 topic_5 8388608 27 2 25 66 45149 780 44369 45929 10 15
System total:
received[#] mean[us] late[#] late[%] too_late[#] too_late[%] lost[#] lost[%]
135 45681 10 7.407 125 92.59 330 70.97
And with Shared memory ros2 run performance_test_factory subscriber_nodes_main --subs 5 --pubs 1 --duration 10 --use_ipc 0 --msg_type stamped8mb
node topic size[b] received[#] late[#] too_late[#] lost[#] mean[us] sd[us] min[us] max[us] freq[hz] duration[s]
node_0 topic_5 8388608 101 101 0 0 8874 1342 6976 18751 10 15
node_1 topic_5 8388608 101 101 0 0 10481 1358 8653 19645 10 15
node_2 topic_5 8388608 101 101 0 0 11262 1331 9367 20374 10 15
node_3 topic_5 8388608 101 101 0 0 9339 1385 7020 17185 10 15
node_4 topic_5 8388608 101 101 0 0 6508 705 5805 12845 10 15
System total:
received[#] mean[us] late[#] late[%] too_late[#] too_late[%] lost[#] lost[%]
505 9293 505 100 0 0 0 0
For good measure I also ran the test with Eclipse Cyclone DDS which uses UDP for comparison
export RMW_IMPLEMENTATION=rmw_cyclonedds_cpp
ros2 run performance_test_factory subscriber_nodes_main --subs 5 --pubs 1 --duration 10 --use_ipc 0 --msg_type stamped8mb
node topic size[b] received[#] late[#] too_late[#] lost[#] mean[us] sd[us] min[us] max[us] freq[hz] duration[s]
node_0 topic_5 8388608 101 98 3 0 16405 3216 13170 36686 10 15
node_1 topic_5 8388608 101 98 3 0 17203 3275 13924 37441 10 15
node_2 topic_5 8388608 101 98 3 0 17886 3277 14582 38153 10 15
node_3 topic_5 8388608 101 98 3 0 18571 3281 15211 38856 10 15
node_4 topic_5 8388608 101 98 3 0 19266 3295 15843 39557 10 15
System total:
received[#] mean[us] late[#] late[%] too_late[#] too_late[%] lost[#] lost[%]
505 17866 490 97.03 15 2.97 0 0
So it seems like I was mistaken and there are indeed large improvements to throughput with the shared memory implementation, outperforming Cyclone by a significant amount when using shared memory. The lost rate of the messages when using UDP is kind of concerning, but not the point of this issue.
I'll close it out.
Hi @eric1221bday ,
I would like to introduce each other, to know details about your use case. Could you please drop me an email to JaimeMartin@eProsima.com , please?
Expected Behavior
It is expected that the shared memory implementation should offer comparable message latency to UDP for small message sizes, and have superior (lower) message latency to UDP for large message sizes
Current Behavior
As far as can be determined message transport time between UDP and SHM are either equivalent or SHM is slightly slower on both small and large message sizes
Steps to Reproduce
mkdir -p ros2_profiling_ws/src
cd ros2_profiling_ws
git clone https://github.com/irobot-ros/ros2-performance.git
colcon build
DEFAULT_FASTRTPS_PROFILES.xml
and put in workspace directoryExecute all below in workspace directory so the profile is used
cd ros2_profiling_ws
DEFAULT_FASTRTPS_PROFILES.xml
and make sure thetransport_id
of theuserTransports
of the default participant profile is set toshm
ros2 run performance_test_factory simple_pub_sub_main --subs 5 --pubs 1 --duration 10 --use_ipc 0 --msg_type stamped8mb
System total: received[#] mean[us] late[#] late[%] too_late[#] too_late[%] lost[#] lost[%]
500 5690 308 61.6 0 0 0 0
node topic size[b] received[#] late[#] too_late[#] lost[#] mean[us] sd[us] min[us] max[us] freq[hz] duration[s] node_0 topic_5 8388608 100 1 0 0 2881 619 2501 8821 10 10
node_1 topic_5 8388608 100 1 0 0 3869 600 3494 9604 10 10
node_2 topic_5 8388608 100 8 0 0 4853 685 4469 11453 10 10
node_3 topic_5 8388608 100 100 0 0 5695 786 5270 13321 10 10
node_4 topic_5 8388608 100 100 0 0 6610 853 6122 14768 10 10
System total: received[#] mean[us] late[#] late[%] too_late[#] too_late[%] lost[#] lost[%]
500 4782 210 42 0 0 0 0