Pointcloud Container Not Working with More than 5 Lidars

NilaySener commented 1 year ago

Checklist

[X] I've read the contribution guidelines.
[X] I've searched other issues and no duplicate issues were found.
[X] I'm convinced that this is not my fault but a bug.

Description

Hi, i'm having a problem with pointcloud_container running all the lidars on my system. I have

 5 velodyne VLP16
 1 velodyne VLS128

If I run all lidars with the use_poincloud_container parameter ture value, one of the lidars does not work, rarely all lidars work.
If I run pointcloud container outside the autoware.launch with the use_poincloud_container parameter ture value, all lidars work fine.
Also, If I run all lidars with use_poincloud_container parameter value false, all lidars work fine. In order to make the situation I mentioned more understandable, I drew the diagram below:

Expected behavior

Running all lidars in the system with containers

Actual behavior

One of the lidars not working when using the container

Steps to reproduce

Running lidars with pointcloud container in a system with more than 5 lidars

Versions

OS: Ubuntu 22.04
ROS 2 : Humble

Possible causes

No response

Additional context

No response

mitsudome-r commented 1 year ago

Further description in Software WG:

When it fails, random lidar fails to publish poitncloud
We still see velodyne_packets, but no pointcloud2 topic
LeoDrive will check if they can see all nodes are launched when there is a failure.

mitsudome-r commented 1 year ago

Possible issue that is relevant to this one: https://github.com/autowarefoundation/autoware.universe/issues/1009

kaancolak commented 1 year ago

I tried to reproduce the problem with pcap file on my local machine. But it works without any problem each time. I will look into the vehicle's computer.

xmfcx commented 1 year ago

@kaancolak are there any updates on this task?

kaancolak commented 1 year ago

I think this issue is not about autoware, probably it related to DDS or ROS2 settings. When I created a basic component container and load simple composable nodes into it, like this ROS2 tutorial :
https://docs.ros.org/en/foxy/Tutorials/Intermediate/Composition.html

ros2 component list command does not return the correct answer. DDS looks properly configured. Maybe a fresh installation could solve the problem.

stale[bot] commented 1 year ago

This pull request has been automatically marked as stale because it has not had recent activity.

VRichardJP commented 12 months ago

I am facing this very issue on a brand new setup with 6 lidars (and the new nebula interface). 1 to 2 random lidar sensors are always missing. The missing lidars simply don't appear in the logs, as if they were never started or stuck while starting. The missing lidars don't show up in ros2 component list.

In the past I have used this trick to solve a similar issue, however this time it seems to have no effect.

I have tried to split the sensors into 2 different containers localization_pointcloud_container and perception_pointcloud_container, containing respectively 2 and 4 sensors. However in the later there is still always 1 component missing.

The problem does not appear if use_pointcloud_container is set to false, so it rather looks like a ROS2 container/component/launch problem (or DDS).

Has anyone come up with a robust solution?

xmfcx commented 12 months ago

Just to see if this issue could be fixed if we used only launch.xml files in the entire launch system, could you test converting the launch structure to a single launch.xml file with this tool:

https://github.com/xmfcx/launch_unifier_ws.git

It will generate the launch file with absolute paths and finalized topic names. So, this is to see if an .xml based launch config could solve the issue.

Also make sure you modify the robot_state_publisher part manually.

xmfcx commented 12 months ago

If this solves the issue, then at least all the component based launch files should be converted to .xml format.

xmfcx commented 12 months ago

I believe, in our vehicles, a bash script launches the perception container first, then another one launches the rest of the Autoware.

Changing the overall launch structure is one of my goals in the Autoware project but it's a lot of work as you might presume.

I've opened up https://github.com/xmfcx/one_launch to replace the current launch structure but it's WIP.

To make my life easier working on this task, I've created the launch_unifier tool otherwise I was going to lose my mind trying to figure out what is launching what in the complex web of launch files.

Also since the workaround to decouple the container from the rest of the launch files exist, no resource is assigned to resolving this task just yet.

VRichardJP commented 12 months ago

Thank you for the very detailed feedback! I'll test that!

VRichardJP commented 12 months ago

@xmfcx I have tested with the generated launch file and... it performs even worse than before! In most cases only 1 driver is loaded (sometimes even 0).

For example, here are 2 attempts with the autoware launch file (I use different colors for each sensor):

Screenshot from 2023-11-08 10-02-18

In both cases there are 3 sensors missing. Yesterday it was typically 2, today it looks like 3 is the lucky number.

Here are 2 other attempts with the generated launch file:

Screenshot from 2023-11-08 10-14-16

Just to be sure, I have tried to split the single launch xml into 2: one for the containers and one for the rest. Unfortunately, it does not change anything.

If the problem is related to some timing issue, I am not surprised it get worse with the single xml, as it should be processed faster.

As I tried to run the ros2 launch command with --debug, I have observed something strange though: the ros2 launch calls the pointcloud_container/_container/load_node service 41 times, but only 20 responses are received:

Screenshot from 2023-11-08 11-03-53

Screenshot from 2023-11-08 11-04-35

So it looks like the load component requests may be dropped randomly when many are sent at the same time...?

xmfcx commented 12 months ago

Wait just to be sure, you

generate a single launch file.
split containers out
run containers first
wait until containers are launched
run the rest
some components fail to load due to some random component load service calls failing

Is this correct?

This is interesting and we should be able to create a simpler reproducable setup with this and send a bug report to launch_ros repo.

I will try to replicate it today outside of the Autoware context and try to fix it in the launch_ros itself If I can.

Edit:

Actually this should be fixed of course in the component container service insread since it will omit some load calls over some queue...

VRichardJP commented 12 months ago

Found the problem... and I feel very stupid about how obvious the cause was.

So, as I pointed out earlier, components are loaded in the container using a dedicated service and some of the service calls seem to be dropped. What is the number 1 reason a message may be dropped with ROS ? => if the message buffer is full!

As you can see below, components are managed by the ComponentManager node, which creates the load_node service with the default QoS profile:

https://github.com/ros2/rclcpp/blob/0f331f90a99da40f7b7a69b4f55b30b7245d294f/rclcpp_components/src/component_manager.cpp#L40

  loadNode_srv_ = create_service<LoadNode>(
    "~/_container/load_node",
    std::bind(&ComponentManager::on_load_node, this, _1, _2, _3));
  unloadNode_srv_ = create_service<UnloadNode>(
    "~/_container/unload_node",
    std::bind(&ComponentManager::on_unload_node, this, _1, _2, _3));
  listNodes_srv_ = create_service<ListNodes>(
    "~/_container/list_nodes",
    std::bind(&ComponentManager::on_list_nodes, this, _1, _2, _3));

The default QoS profile has a depth of 10, so if we try to load 41 components into the container at the same time... yes, a few may be dropped.

I cloned rclcpp_components module into the autoware workspace, added a QoS profile with a queue size of 100, and now it works just perfectly!

Ping @mitsudome-r Although a fix should be proposed upstream, there is a serious issue with the current code base: there is currently no guarantee autoware containers are properly loaded if they are supposed to contain more than 10 components!

xmfcx commented 12 months ago

I've created the following basic launch file:

import launch
from launch_ros.actions import ComposableNodeContainer
from launch_ros.descriptions import ComposableNode
from launch_ros.actions import LoadComposableNodes

def generate_launch_description():
    container = ComposableNodeContainer(
            name='my_container',
            package='rclcpp_components',
            executable='component_container',
            namespace='',
            output='log',
    )

    composable_nodes = []
    for i in range(100):

        composable_nodes.append(
            ComposableNode(
                package='composition',
                plugin='composition::Talker',
                name='talker_' + str(i),
            )
        )

    loader = LoadComposableNodes(
        composable_node_descriptions=composable_nodes,
        target_container="my_container",
    )
    return launch.LaunchDescription([container, loader])

This never fails to run any amount of components, even when set to 1000 (although it takes it time).

So maybe the launching nodes should have heavier constructors to replicate this.

Also I've realized that output='log' vs output='screen' creates a huge difference in loading times. On an unrelated note, we might want to switch all the outputs to log for faster loading speeds.

VRichardJP commented 12 months ago

It might be because nebula driver has to load "heavy" shared libraries, while the Ros talker has no dependency. On my machine, pointclouds appear one by one with a few seconds in between. I guess the problem is also more likely to happen if the CPU is stressed. One way to highlight the issue would be to add a sleep inside the load_node callback.

xmfcx commented 12 months ago

Following is the best course of action I can think of:

We fork https://github.com/ros2/rclcpp/tree/humble into this organization
Remove all other packages in the repository other than rclcpp_components
- Or add COLCON_IGNORE, to the unwanted packages
Modify the rclcpp_components package to suit our needs.
Propose the changes to the rolling
Once merged, depending on the complexity and the API/ABI compatibility, propose a backport for humble
Remove the forked repository.

As for how best to add this functionality to the component_manager.cpp, we could do one the following:

Just increase the queue size with rclcpp::QoS(100) to another constant value
Make the Queue size parameterized:
- Add a declared parameter with default value of 10?
- It is possible to run component manager with --ros-args -p queue_size:=100

Maybe there are other methods too but these are the first solutions I could come up with.

What do you think @VRichardJP @mitsudome-r @isamu-takagi @wep21 @esteve ?

esteve commented 12 months ago

@xmfcx I'd prefer not to fork rclcpp_components and instead inherit/wrap the component manager, and at the same time raise this issue upstream. However, it seems to me that all the components are being loaded at the same time, hence the queue being too short. Would it be possible to load the components sequentially instead? Additionally, we could use event handlers to ensure that a component has been fully loaded before loading the next.

xmfcx commented 12 months ago

@esteve Thank you for your suggestions.

I'd prefer not to fork rclcpp_components and instead inherit/wrap the component manager, and at the same time raise this issue upstream.

Wrapping would require us to change the package names / node executables that we use in the launch files right? If so this would require more changes in our end than using a fork.

Overriding a package and keeping the same name
vs Creating a wrapper package and editing all container calls in the launch files

However, it seems to me that all the components are being loaded at the same time, hence the queue being too short. Would it be possible to load the components sequentially instead? Additionally, we could use event handlers to ensure that a component has been fully loaded before loading the next.

Even in its current state, Autoware loads slow already. I feel that adding some sort of event handler to wait for each component to load before sending the next would require 2 ways synchronization and increase this delay. I would prefer increasing the queue size and let the service server handle it as fast as possible. The service calls are not big messages anyway.

Also this would end up being a simpler solution too, just increasing the queue size.

VRichardJP commented 12 months ago

@xmfcx I think in your example the single LoadComposableNodes will load components one by one, while in autoware case, the components are loaded simultaneously from multiple launch files. I can reproduce the issue with the following script:

import launch
from launch_ros.actions import ComposableNodeContainer
from launch_ros.descriptions import ComposableNode
from launch_ros.actions import LoadComposableNodes

def generate_launch_description():
    container = ComposableNodeContainer(
            name='my_container',
            package='rclcpp_components',
            executable='component_container',
            namespace='',
            output='log',
    )

    loaders = []
    for i in range(20):

        node = ComposableNode(
            package='composition',
            plugin='composition::Talker',
            name='talker_' + str(i),
        )

        loaders.append(
          LoadComposableNodes(
                composable_node_descriptions=[node],
                target_container="my_container",
            )
        )

    return launch.LaunchDescription([container, *loaders])

With 10 components, no issue:

$ ros2 component list /my_container | wc -l
10

With 20 components, one is often missing:

$ ros2 component list /my_container | wc -l
19

esteve commented 12 months ago

Wrapping would require us to change the package names / node executables that we use in the launch files right? If so this would require more changes in our end than using a fork.

The problem with forks is that we start using them "as a temporary measure", and then eventually we end up with a huge collection of software that we have to maintain ourselves. It's not just this issue, but if rclcpp_components gets updates upstream and fixes, we would have to keep our fork in sync.

Overriding a package and keeping the same name

This does not include the cost of maintaining our own copy and having to synchronize it periodically with upstream.

Creating a wrapper package and editing all container calls in the launch files

This can probably be automated.

VRichardJP commented 12 months ago

@esteve

I am not a github superuser, but wouldn't it possible to have a clone of rclcpp that automatically tracks humble branch and merge latests upstream commits? As @xmfcx proposed, the fork would only need a bunch of COLCON_IGNORE and to modify 3 lines in a file that is modified once every 2 years. So it would be very unlikely we would run into some merge conflict. Once a fix is available the rclcpp fork would be removed.

xmfcx commented 11 months ago

With:

https://github.com/ros2/rclcpp/pull/2380#event-11126891120 merged, this should be solved with the next humble apt upgrade. Thanks for everyone's efforts on this!

autowarefoundation / autoware.universe