Victorlouisdg commented 6 months ago

Thoughts on multiprocessing (and networking)

I'm creating this issue to collect some thoughts on multiprocessing in general, our options and their pros and cons. Before I commit too much more time to our multiprocessing code, I want to be sure we’re implementing the right solution.

I’ve split this post into several chapters:

The problem with using a single process for everything
The airo-mono philosophy
Options for multiprocessing
Conclusion and Action Points

The problem with using a single process for everything

For some use cases, e.g. retrieving images & point clouds at full resolution and fps (possibly from multiple cameras), servoing at high frequency, recording videos, logging and saving data, it’s very difficult to keep everything running smoothly in a single (Python) process.

Concretely: it's very hard to record videos of your experiments with the same camera that you are using for making robot control decisions, which is a shame.

The `airo-mono` philosophy

Airo-mono has the (implicit) motto: "Keep simple things simple", which has made it a great tool for research and prototyping. In practice, this means keeping everything pip installable (except maybe camera SDKs) and providing the functionality as Python functions, or Python classes with intuitive and standardized interfaces (and a few CLI tools and simple OpenCV “apps”).

The (ideal) standard airo-mono-based project getting-started workflow would be like this:

pip install airo-camera-toolkit
pip install airo-robots

from airo_camera_toolkit import Zed2i
from airo_robots import URRtde

camera = Zed2i()
robot = URRtde()

image = camera.get_rgb_image()
grasp_pose = select_grasp_pose(image)  # example: user provides grasp pose
robot.moveL(grasp_pose).wait()

I think we all agree this has been a great success, and is not something we want to compromise on. So that is important to keep in mind when considering the multiprocess options.

Options for multiprocessing

Multiprocessing (or process-based parallelism) has been around for a long time, and a central topic is inter-process communication. I believe our main options are:

Shared memory
CycloneDDS
ROS2

In the following subsections, I will explain each of these briefly and the pros/cons I believe they have.

Shared memory

Almost all operating systems support the concept of shared memory. Shared memory is simply a part of main memory (RAM) where multiple processes can write/read to (normally, a process has its private part of main memory). Reading and writing to shared memory can thus be very fast.

Python has a built-in package multiprocessing that makes it easy to create blocks of shared memory. You just provide a name and an amount of bytes. Additionally, it integrates with numpy pretty well (however you have to communicate the shape of numpy arrays to receivers yourself over shared memory, which is a bit clunky but it works). This is what I’ve currently used in the MultiprocessRGBPublisher classes.

Pros:

Easy install: no additional installation required as it’s built-in
Fast: we write directly to shared memory with almost no overhead, so this is probably the method with the lowest latency and highest throughput
Flexible: we have full access and can write whatever data/message we want

Cons:

Synchronization: we are responsible for synchronization between writer and reader. For example, I also write a timestamp to shared memory that the reader checks periodically to detect new data.
Data shape: we are responsible for communicating data/array shape (and dtype) to receivers, either by hardcoding or also passing it over shared memory. Maybe using pickle could make this easier.
Clean up: shared memory is not owned by any single process, so even when the process that created it dies, it remains until the PC is rebooted. I’ve had issues where I want to restart my publisher process (e.g. with a larger camera resolution), but the (too small) shared memory files from my previous publisher were not properly cleaned up. This is fixable, but again something we are responsible for.
Single machine: we can only use this to pass data between processes on the same machine that share main memory physically.

The first three cons come down to we have to manage the shared memory ourselves (and maintain that code).

Cyclone DDS

I consider only Cyclone DDS here because as far as I know, it’s the only open-source pip installable DDS ().

DDS is short for Data Distribution Service, and it is a form of inter-process communication. DDS generally also supports passing data between computers connected via a network. For this reason, they often pass data with the IP. CycloneDDS uses UDP by default but can also be configured to use TCP. However, when operating in this mode, throughput is much lower than over shared memory (and likely can’t handle our full camera data streams). To fix this, CycloneDDS also has support for using shared memory, but I’m unsure how easy this is to install and configure.

CycloneDDS Python support seems pretty nice. Defining messages seems to be not much harder than defining a dataclass, see the Github readme for an example (Chatter).

Questions:

Is shared memory support included in the pip install or do we need to build from source?
Pros:
Management: clean up, data shape, types, synchronization etc. are managed by CycloneDSS
Network communication
Pip installable: the base version is pip-installable, but unsure what is included

Python / C++ interoperability Cons:

Custom messages: Still need to define many messages, might be better to use ROS messages

To be honest, I find that CycloneDDS fall into a somewhat undesirable gray zone between doing it ourselves and using ROS2. CycloneDDS is one of the middleware options for ROS2, so maybe if we go this route, we should just bite the bullet use ROS2. It seems silly to me to define custom CycloneDDS messages instead of using many of the existing ROS2 message types.

ROS2

The “old” ROS was in some sense similar to a custom DDS. However, with ROS2 they chose not to implement the communication middleware themselves anymore, but instead rely on several different DDS options. So basically ROS2 has message types that are not specific to any DDS, and it converts these to the message types of the specific DDSs. There are several reasons we have currently opted out of the ROS2 ecosystem.

The first is that system-wide installation seems to be the most supported option (and a sudo apt upgrade can break everyone’s projects). Dependency management for ROS packages in general can also be difficult.
The second problem is that it doesn’t keep simple things simple. Even for creating a simple script, you need a workspace, package, node, etc. that you have to build with Colcon. Then you have to make sure everything you need is launched.

Problem 1 might be solvable, e.g. by running ROS2 in a Docker container. The caveat is that performance will likely not be great. We would probably need to configure the ROS2 DDS to use shared memory, and then also mount the host's shared memory into the Docker container, but that seems doable.

Correction: I’ve just realized that running ROS2 in a Docker container does not solve our problem, as it would require moving our airo-mono Python scripts into the container as well. A better solution might be to explore Robostack, which is a young project that allows installing ROS into conda environments.

Problem 2 is mostly a “dev problem”. If problem 1 can be solved, airo-mono users don’t even need to be aware that ROS2 is being used. For example, an airo-mono user could create a Zed2i(multiprocess=True), which could behind the scenes start a docker container, and run a publisher/receiver that uses the zed_ros_wrapper. Additionally, this could be completely opt-in, e.g. we could raise a RunTimeError if a user enables multiprocess without having Docker installed.

Pros:

Standard message types
Network communication
Python / C++ interoperability
Saving messages to disk
Less reinventing the wheel: opens door to using more ROS2 packages?
Conda installation: if RobotStack works

Cons:

Feasibility: I’m not sure it is possible to use non-system-wide ROS and keep running our Python scripts simply on the host machine (i.e. not in a Docker container).
Experimental: RoboStack is still a very young project and will likely still have issues.
Building: even if we use Robostack, can we (devs) build the required ROS nodes/packages beforehand so that airo-mono users don’t have to?

Conclusion and Action Points

In conclusion, I believe long-term the best solution would be to revisit ROS2, especially if we get it working within conda through RoboStack (paper). However, for the time being, our multiprocessing-based code works well for me and allows me to record videos of my data collection, which is my primary use case for wanting multiprocessing.

Action Points:

Try RoboStack installation
Check ROS2 performance (e.g. can we publish/receive Zed2i at max resolution/fps)
Investigate whether we can prebuild ROS packages for airo-mono users

Victorlouisdg commented 6 months ago

I did some quick testing and the ROS installation through RoboStack went great. Took <5 min and there were no issues, rviz2 worked and the "topic" examples in ros2 examples also worked.

Not all ROS packages are currently supported in RoboStack's conda packages, notably moveit and zed_ros_wraper are missing, but the realsense packages are available. In total 613/1441 packages are supported, I assume they only count packages listed on the ros index.

Given that the installation process seems to be very smooth, the most important remaining issue is performance. The DDS that is provided/default is Fast-DDS. However, it seems to be using UDP for communication (seen in Fast-DDS monitor), even for two processes running on the same computer. This is also probably the reason why I can't publish more than about 1M points smoothly at 10 Hz. Which is about 160 MB/s (each point is 16 bytes in the example). For the full-resolution Zed2i point cloud at 15 fps, we need about 500 MB/s, so it's still quite far off.

Luckily Fast-DDS supports shared memory transport. I hope it's not too difficult to enable that for ROS2. Here are two sources I'm looking into:

Victorlouisdg commented 6 months ago

Enabling shared memory seems fairly simple, I first created this XML file:

<?xml version="1.0" encoding="UTF-8" ?>
<profiles xmlns="http://www.eprosima.com/XMLSchemas/fastRTPS_Profiles">

<!-- Default publisher profile -->
<publisher profile_name="default publisher profile" is_default_profile="true">
    <qos>
    <data_sharing>
        <kind>AUTOMATIC</kind>
    </data_sharing>
    </qos>
    <historyMemoryPolicy>PREALLOCATED_WITH_REALLOC</historyMemoryPolicy>
</publisher>

<!-- Default subscription profile -->
<subscriber profile_name="default subscription profile" is_default_profile="true">
    <qos>
    <data_sharing>
        <kind>AUTOMATIC</kind>
    </data_sharing>
    </qos>
    <historyMemoryPolicy>PREALLOCATED_WITH_REALLOC</historyMemoryPolicy>
</subscriber>
</profiles>

And then set these environment variables:

export RMW_FASTRTPS_USE_QOS_FROM_XML=1
export FASTRTPS_DEFAULT_PROFILES_FILE=/home/idlab185/ros2_examples/rclpy/topics/pointcloud_publisher/examples_rclpy_pointcloud_publisher/fastdds_profile.xml

However this did not seem to use shared memory. So I tried forcing it by changing AUTOMATIC to ON. That led to this error:

[DATA_WRITER Error] Data sharing cannot be used with unbounded data types -> Function check_datasharing_compatible

The ROS message my node is trying to publish is:

from sensor_msgs.msg import PointCloud2

It seems like ROS is thus publishing that as an unbounded Fast-DDS data type. I hope that can be configured, or that we can define custom ROS messages that are bounded.

https://github.com/ros2/rclcpp/issues/2201
https://www.apex.ai/post/zero-copy-strings-and-bounded-vectors

Apex.OS* has a specialized transport mechanism that allows the developer to publish messages through shared memory without requiring copies. However, until now, many data types for which this would be most useful, such as large images and point clouds, were incompatible with this zero-copy transport.

Victorlouisdg commented 6 months ago

The docs in rmw_cyclonedds acknowledge this issue as well:

To actually use Shared Memory the talker/listener example needs to be slightly rewritten to use a fixed size data type such as an unsigned integer. Adapting the publisher and subscription to use messages of type std_msgs::msg::Uint32 instead leads to an example which uses Shared Memory to transport the data.

So as far as I know it's a known ROS2 limitation that most of the messages (those with unbounded types) in common_interfaces cannot at the moment be passed over shared memory. However, the fix seems pretty straightforward. We make copies of the messages we need and add an upper bound to the amount of elements.

It's a bit of a pity that we can't make use of the standard interfaces (for now and if we need high performance), and I hope we can still visualize our customized (bounded) point cloud messages in rviz2 etc.

Victorlouisdg commented 6 months ago

After reading this comment I'm afraid that making the types bounded is not sufficient, as ROS still also uses std::vector for these, which means DDSs probably won't be able to pass them over shared memory. So the constraint is even more restrictive: we need to use fixed size data types.

What this means in practice is that we will need custom messages for each camera resolution we want to be able to pass over shared memory with ROS2, e.g:

Image_2208x1242.msg
Image_1920x1080.msg
... References:
sensor_msgs/Image.msg

tlpss commented 6 months ago

huh, I didn't know that. A little annoying we would need to create different messages for each type, but I can live with it.

What is the throughput with shared memory that you can get with FastDDS?

And for the network communication, did you tune the configuration (best effort,max throughput?)

Victorlouisdg commented 6 months ago

Shared memory is RAM memory, so theoretically we could get up to ~40 GB/s on Gorilla. I assume the DDS implementations are quite optimized and won't add too much overhead for large arrays (e.g. images). However we still have to test this in practice and see it we can get this configured.

For the network communication I didn't change any of the default ROS2 or FastDDS settings. A very rough estimate for the throughput I got with UDP was 160 MB/s. For CycloneDDS there is some tuning adivce in this ROS2 How-to guide. For FastDDS (formerly Fast RTPS) I haven't found instructions. Another thing to check is whether we need to explicitly configure the usage of the "loopback interface" when transferring data between processes on the same host. Maybe this is enabled when we set the ROS_LOCALHOST_ONLY environment variable.

Victorlouisdg commented 5 months ago

Concretely what I'm proposing:

We want to keep our simple camera API e.g. that MultiprocessRGBReceiver behaves just like a regular Zed2i with camera.get_rgb_image()
However, we would like not to have to manage the shared memory ourselves (see MultiprocessRGbPublisher). So we would like to replace e.g. the memory copies in our run() function with publish(ros_message)

We can start from this ros2 example: pointcloud_publisher.py

I would create a airo-ros2 sister repo for this and that will replace the multiprocess subpackage of airo-camera-toolkit. In the sister repo we should also document our recommended ROS2 installation method, which is in conda through RoboStack. (However, for the MultiprocessRGBPublisher it should not matter how ROS2 is installed, we just need to be able to import rclpy)

adverley commented 5 months ago

Thanks! Let's evaluate this on (1) complexity for end-user and developers and (2) throughput performance.

m-decoster commented 2 months ago

I may have an easy to install and use alternative for RoboStack (which, unlike Victor-Louis' experience above, I got very annoyed with during the installation process).

I was planning to benchmark a couple of libraries & frameworks for IPC, but after 0MQ worked pretty much out of the box, I will stop here for now and continue to investigate this instead.

0MQ is pip installable (pip install pyzmq) and supports IPC over shared memory (or TCP, or some other protocols, as desired) with a socket-like interface.

The code (below) supports publishing RGB images, depth images, and colored point clouds from one process and subscribing from another. Since the publishing process can be launched as a child process, the code is as easy to use as Victor-Louis' current solution in airo-camera-toolkit.

So far, it looks like I can achieve a throughput of about 600MBps. There's no need to manage shared memory ourselves, since 0MQ does it all for free. Though we do still need to handle serialization ourselves. As long as you just send NumPy arrays, it's easy to do (just use np.load and np.save with a byte array as "file"). Strings are trivial to send, and arbitrary Python objects can be automatically pickled (with a performance overhead, so don't do this for things like point clouds).

The code itself is not very complex either: it's only about 100 lines for published and subscriber. See https://gist.github.com/m-decoster/2eea84ad5fb4d364724af54aca70a1d4

To be continued

Update: without depth images I get a throughput of 1261MB/s which is sufficient to send over point clouds and RGB images at 15 FPS! Possibly this line is the culprit causing get_depth_image to slow things down

tlpss commented 1 month ago

thanks for digging into this @m-decoster!

I'm dumping a few links that were on my ' to read' list on this topic:

Looking forward to your findings!

tlpss commented 1 month ago

Update: without depth images I get a throughput of 1261MB/s which is sufficient to send over point clouds and RGB images at 15 FPS! Possibly this line is the culprit causing get_depth_image to slow things down

Ah I think @Victorlouisdg had also identified this line as a huge performance hit. We since have some tooling to benchmark the code here. I'm surprised that this line is still in the main branch though, maybe @Victorlouisdg remembers what we decided to do

Victorlouisdg commented 1 month ago

Be sure to check out my last comment here about the multiprocess branch. The branch contains the bug fixes and performance improvements I needed for the Cloth Competition. It worked perfectly for the entire competition (many hours of stress testing), allowing me to pass all data from a single ZED2i over shared memory at 2K and 15 fps, while also recording the video of the left RGB view. So apart from some code quality checks, I believe it can just be merged into main. (I see this as our final attempt at managing shared memory ourselves.)

Then for the future, I agree we should look to outsource our multiprocess communication. The speeds of ZeroMQ seem promising, and it seems like you can define data shape/size at runtime (as opposed to compile-time for ros2). So it's definitely worth considering. However, I'm honestly still a fan of exploring the ros2 option first, because it is more standard in the robotics community. As a lab, I think we could save a lot of time if we embraced ros (e.g. also for schunk drivers and navigation), instead of avoiding it at all costs.

airo-ugent / airo-mono