Shared memory transport slower than TCP or UDP

oceanusxiv commented 4 years ago

Expected Behavior

It is expected that the shared memory implementation should offer comparable message latency to UDP for small message sizes, and have superior (lower) message latency to UDP for large message sizes

Current Behavior

As far as can be determined message transport time between UDP and SHM are either equivalent or SHM is slightly slower on both small and large message sizes

Steps to Reproduce

Setup machine for ROS 2: steps here
mkdir -p ros2_profiling_ws/src
cd ros2_profiling_ws
git clone https://github.com/irobot-ros/ros2-performance.git
colcon build
copy attached DEFAULT_FASTRTPS_PROFILES.xml and put in workspace directory

Execute all below in workspace directory so the profile is used

cd ros2_profiling_ws
Open DEFAULT_FASTRTPS_PROFILES.xml and make sure the transport_id of the userTransports of the default participant profile is set to shm
ros2 run performance_test_factory simple_pub_sub_main --subs 5 --pubs 1 --duration 10 --use_ipc 0 --msg_type stamped8mb

Terminal output should give a report for shared memory run


node           topic          size[b]   received[#]    late[#]   too_late[#]    lost[#]   mean[us]  sd[us]    min[us]   max[us]   freq[hz]  duration[s]
node_0         topic_5        8388608   100            1         0              0         3218      712       2340      8223      10        10        
node_1         topic_5        8388608   100            55        0              0         4822      1185      3213      9423      10        10        
node_2         topic_5        8388608   100            55        0              0         6054      1416      4080      10578     10        10        
node_3         topic_5        8388608   100            97        0              0         6803      1440      4872      13242     10        10        
node_4         topic_5        8388608   100            100       0              0         7554      1441      5599      15064     10        10

System total: received[#] mean[us] late[#] late[%] too_late[#] too_late[%] lost[#] lost[%]
500 5690 308 61.6 0 0 0 0

5. Open `DEFAULT_FASTRTPS_PROFILES.xml` and change the `transport_id` of the `userTransports` of the default participant profile is set to `udp`
6. `ros2 run performance_test_factory simple_pub_sub_main --subs 5 --pubs 1 --duration 10 --use_ipc 0 --msg_type stamped8mb`
7. Terminal output should give a report for UDP run

node topic size[b] received[#] late[#] too_late[#] lost[#] mean[us] sd[us] min[us] max[us] freq[hz] duration[s] node_0 topic_5 8388608 100 1 0 0 2881 619 2501 8821 10 10
node_1 topic_5 8388608 100 1 0 0 3869 600 3494 9604 10 10
node_2 topic_5 8388608 100 8 0 0 4853 685 4469 11453 10 10
node_3 topic_5 8388608 100 100 0 0 5695 786 5270 13321 10 10
node_4 topic_5 8388608 100 100 0 0 6610 853 6122 14768 10 10

System total: received[#] mean[us] late[#] late[%] too_late[#] too_late[%] lost[#] lost[%]
500 4782 210 42 0 0 0 0



As can be seen above, UDP run had mean delay of 4782us, whereas shared memory run had mean delay of 5690us, a difference of almost 1000us, whereas shared memory is supposed to _reduce_ message latency.
## System information
<!--- Provide information about your system environment. -->

- Fast-RTPS version: 2.0.x branch 41aaa74264edc65a1e5a8a68781a331635019d31
- OS: Ubuntu 20.04
- Network interfaces: lo (127.0.0.1)
- ROS2: ROS2 master (Foxy) https://github.com/ros2/ros2/commit/c7c115a7f9d78985903bac05d7dd0d886b808b61

## Additional context
<!--- Add any context about the problem here. -->

## Additional resources
- Wireshark capture <!--- A network capture will be helpful for us -->
- XML profiles file [DEFAULT_FASTRTPS_PROFILES.xml.zip](https://github.com/eProsima/Fast-RTPS/files/4554555/DEFAULT_FASTRTPS_PROFILES.xml.zip)

MiguelCompany commented 4 years ago

@eric1221bday The test you are running is using neither UDP nor Shared Memory.

We've been running this for several hours, and the results showed high dispersion. Some times UDP was better, some Shared Memory was better. It was after a while that we realized the test runs on a single process.

Fast RTPS has an intra-process mechanism that short-circuits the transports. We have other tests that measure latency between several processes on the same machine, and those show quite better results for Shared Memory than for UDP

oceanusxiv commented 4 years ago

@MiguelCompany That's kind of weird then, something I did that I didn't mention here was running this test with --use_ipc 1, on ROS end enabling intra-process transport, the performance is 10x better in that scenario, which is why I assumed that --use_ipc 0 does not use intra-process transport. In that case I'm not sure where the latency disparity between --use_ipc 1 and --use_ipc 0 is coming from, if they're both using intra-process transport, unless the ROS mechanism is truly giving such massive overheads as a 10x latency increase.

EDIT: FYI here are the results when running ros2 run performance_test_factory simple_pub_sub_main --subs 5 --pubs 1 --duration 10 --use_ipc 1 --msg_type stamped8mb for me:

node           topic          size[b]   received[#]    late[#]   too_late[#]    lost[#]   mean[us]  sd[us]    min[us]   max[us]   freq[hz]  duration[s]
node_0         topic_5        8388608   100            0         0              0         110       26        49        355       10        10        
node_1         topic_5        8388608   100            0         0              0         120       28        55        391       10        10        
node_2         topic_5        8388608   100            0         0              0         124       29        59        401       10        10        
node_3         topic_5        8388608   100            0         0              0         128       29        62        405       10        10        
node_4         topic_5        8388608   100            0         0              0         132       29        66        409       10        10        

System total:
received[#]    mean[us]  late[#]   late[%]   too_late[#]    too_late[%]    lost[#]   lost[%]   
500            123       0         0         0              0              0         0

EDIT 2: With the results below it seems like the latency difference is indeed just the overhead of the handoff from ROS to the RMW.

oceanusxiv commented 4 years ago

Alright I ran a couple more experiments. This time separating the runs so the pubs and subs are in different processes, here are the results.

For each of the test, ros2 run performance_test_factory publisher_nodes_main --subs 5 --pubs 1 --duration 10 --use_ipc 0 --msg_type stamped8mb is run first on another terminal.

With UDP ros2 run performance_test_factory subscriber_nodes_main --subs 5 --pubs 1 --duration 10 --use_ipc 0 --msg_type stamped8mb

node           topic          size[b]   received[#]    late[#]   too_late[#]    lost[#]   mean[us]  sd[us]    min[us]   max[us]   freq[hz]  duration[s]
node_0         topic_5        8388608   27             2         25             66        44769     448       44321     45216     10        15        
node_1         topic_5        8388608   27             2         25             66        46660     661       45999     47320     10        15        
node_2         topic_5        8388608   27             2         25             66        47464     662       46802     48125     10        15        
node_3         topic_5        8388608   27             2         25             66        44363     799       43564     45161     10        15        
node_4         topic_5        8388608   27             2         25             66        45149     780       44369     45929     10        15        

System total:
received[#]    mean[us]  late[#]   late[%]   too_late[#]    too_late[%]    lost[#]   lost[%]   
135            45681     10        7.407     125            92.59          330       70.97

And with Shared memory ros2 run performance_test_factory subscriber_nodes_main --subs 5 --pubs 1 --duration 10 --use_ipc 0 --msg_type stamped8mb

node           topic          size[b]   received[#]    late[#]   too_late[#]    lost[#]   mean[us]  sd[us]    min[us]   max[us]   freq[hz]  duration[s]
node_0         topic_5        8388608   101            101       0              0         8874      1342      6976      18751     10        15        
node_1         topic_5        8388608   101            101       0              0         10481     1358      8653      19645     10        15        
node_2         topic_5        8388608   101            101       0              0         11262     1331      9367      20374     10        15        
node_3         topic_5        8388608   101            101       0              0         9339      1385      7020      17185     10        15        
node_4         topic_5        8388608   101            101       0              0         6508      705       5805      12845     10        15        

System total:
received[#]    mean[us]  late[#]   late[%]   too_late[#]    too_late[%]    lost[#]   lost[%]   
505            9293      505       100       0              0              0         0

For good measure I also ran the test with Eclipse Cyclone DDS which uses UDP for comparison

export RMW_IMPLEMENTATION=rmw_cyclonedds_cpp ros2 run performance_test_factory subscriber_nodes_main --subs 5 --pubs 1 --duration 10 --use_ipc 0 --msg_type stamped8mb

node           topic          size[b]   received[#]    late[#]   too_late[#]    lost[#]   mean[us]  sd[us]    min[us]   max[us]   freq[hz]  duration[s]
node_0         topic_5        8388608   101            98        3              0         16405     3216      13170     36686     10        15        
node_1         topic_5        8388608   101            98        3              0         17203     3275      13924     37441     10        15        
node_2         topic_5        8388608   101            98        3              0         17886     3277      14582     38153     10        15        
node_3         topic_5        8388608   101            98        3              0         18571     3281      15211     38856     10        15        
node_4         topic_5        8388608   101            98        3              0         19266     3295      15843     39557     10        15        

System total:
received[#]    mean[us]  late[#]   late[%]   too_late[#]    too_late[%]    lost[#]   lost[%]   
505            17866     490       97.03     15             2.97           0         0

So it seems like I was mistaken and there are indeed large improvements to throughput with the shared memory implementation, outperforming Cyclone by a significant amount when using shared memory. The lost rate of the messages when using UDP is kind of concerning, but not the point of this issue.

I'll close it out.

JaimeMartin commented 4 years ago

Hi @eric1221bday ,

I would like to introduce each other, to know details about your use case. Could you please drop me an email to JaimeMartin@eProsima.com , please?

eProsima / Fast-DDS