Closed NilaySener closed 11 months ago
Further description in Software WG:
Possible issue that is relevant to this one: https://github.com/autowarefoundation/autoware.universe/issues/1009
I tried to reproduce the problem with pcap file on my local machine. But it works without any problem each time. I will look into the vehicle's computer.
@kaancolak are there any updates on this task?
I think this issue is not about autoware, probably it related to DDS or ROS2 settings. When I created a basic component container and load simple composable nodes into it, like this ROS2 tutorial :
https://docs.ros.org/en/foxy/Tutorials/Intermediate/Composition.html
ros2 component list
command does not return the correct answer. DDS looks properly configured. Maybe a fresh installation could solve the problem.
This pull request has been automatically marked as stale because it has not had recent activity.
I am facing this very issue on a brand new setup with 6 lidars (and the new nebula interface). 1 to 2 random lidar sensors are always missing. The missing lidars simply don't appear in the logs, as if they were never started or stuck while starting. The missing lidars don't show up in ros2 component list
.
In the past I have used this trick to solve a similar issue, however this time it seems to have no effect.
I have tried to split the sensors into 2 different containers localization_pointcloud_container
and perception_pointcloud_container
, containing respectively 2 and 4 sensors. However in the later there is still always 1 component missing.
The problem does not appear if use_pointcloud_container
is set to false
, so it rather looks like a ROS2 container/component/launch problem (or DDS).
Has anyone come up with a robust solution?
Just to see if this issue could be fixed if we used only launch.xml
files in the entire launch system, could you test converting the launch structure to a single launch.xml
file with this tool:
It will generate the launch file with absolute paths and finalized topic names. So, this is to see if an .xml
based launch config could solve the issue.
Also make sure you modify the robot_state_publisher part manually.
If this solves the issue, then at least all the component based launch files should be converted to .xml
format.
I believe, in our vehicles, a bash script launches the perception container first, then another one launches the rest of the Autoware.
Changing the overall launch structure is one of my goals in the Autoware project but it's a lot of work as you might presume.
I've opened up https://github.com/xmfcx/one_launch to replace the current launch structure but it's WIP.
To make my life easier working on this task, I've created the launch_unifier tool otherwise I was going to lose my mind trying to figure out what is launching what in the complex web of launch files.
Also since the workaround to decouple the container from the rest of the launch files exist, no resource is assigned to resolving this task just yet.
Thank you for the very detailed feedback! I'll test that!
@xmfcx I have tested with the generated launch file and... it performs even worse than before! In most cases only 1 driver is loaded (sometimes even 0).
For example, here are 2 attempts with the autoware launch file (I use different colors for each sensor):
In both cases there are 3 sensors missing. Yesterday it was typically 2, today it looks like 3 is the lucky number.
Here are 2 other attempts with the generated launch file:
Just to be sure, I have tried to split the single launch xml into 2: one for the containers and one for the rest. Unfortunately, it does not change anything.
If the problem is related to some timing issue, I am not surprised it get worse with the single xml, as it should be processed faster.
As I tried to run the ros2 launch
command with --debug
, I have observed something strange though: the ros2 launch calls the pointcloud_container/_container/load_node
service 41 times, but only 20 responses are received:
So it looks like the load component requests may be dropped randomly when many are sent at the same time...?
Wait just to be sure, you
Is this correct?
This is interesting and we should be able to create a simpler reproducable setup with this and send a bug report to launch_ros repo.
I will try to replicate it today outside of the Autoware context and try to fix it in the launch_ros itself If I can.
Edit:
Actually this should be fixed of course in the component container service insread since it will omit some load calls over some queue...
Found the problem... and I feel very stupid about how obvious the cause was.
So, as I pointed out earlier, components are loaded in the container using a dedicated service and some of the service calls seem to be dropped. What is the number 1 reason a message may be dropped with ROS ? => if the message buffer is full!
As you can see below, components are managed by the ComponentManager
node, which creates the load_node
service with the default QoS profile:
loadNode_srv_ = create_service<LoadNode>(
"~/_container/load_node",
std::bind(&ComponentManager::on_load_node, this, _1, _2, _3));
unloadNode_srv_ = create_service<UnloadNode>(
"~/_container/unload_node",
std::bind(&ComponentManager::on_unload_node, this, _1, _2, _3));
listNodes_srv_ = create_service<ListNodes>(
"~/_container/list_nodes",
std::bind(&ComponentManager::on_list_nodes, this, _1, _2, _3));
The default QoS profile has a depth of 10, so if we try to load 41 components into the container at the same time... yes, a few may be dropped.
I cloned rclcpp_components
module into the autoware workspace, added a QoS profile with a queue size of 100, and now it works just perfectly!
Ping @mitsudome-r Although a fix should be proposed upstream, there is a serious issue with the current code base: there is currently no guarantee autoware containers are properly loaded if they are supposed to contain more than 10 components!
I've created the following basic launch file:
import launch
from launch_ros.actions import ComposableNodeContainer
from launch_ros.descriptions import ComposableNode
from launch_ros.actions import LoadComposableNodes
def generate_launch_description():
container = ComposableNodeContainer(
name='my_container',
package='rclcpp_components',
executable='component_container',
namespace='',
output='log',
)
composable_nodes = []
for i in range(100):
composable_nodes.append(
ComposableNode(
package='composition',
plugin='composition::Talker',
name='talker_' + str(i),
)
)
loader = LoadComposableNodes(
composable_node_descriptions=composable_nodes,
target_container="my_container",
)
return launch.LaunchDescription([container, loader])
This never fails to run any amount of components, even when set to 1000 (although it takes it time).
So maybe the launching nodes should have heavier constructors to replicate this.
Also I've realized that output='log'
vs output='screen'
creates a huge difference in loading times.
On an unrelated note, we might want to switch all the outputs to log
for faster loading speeds.
It might be because nebula driver has to load "heavy" shared libraries, while the Ros talker has no dependency. On my machine, pointclouds appear one by one with a few seconds in between. I guess the problem is also more likely to happen if the CPU is stressed. One way to highlight the issue would be to add a sleep inside the load_node
callback.
Following is the best course of action I can think of:
rclcpp_components
COLCON_IGNORE
, to the unwanted packagesrclcpp_components
package to suit our needs.humble
As for how best to add this functionality to the component_manager.cpp, we could do one the following:
rclcpp::QoS(100)
to another constant value--ros-args -p queue_size:=100
Maybe there are other methods too but these are the first solutions I could come up with.
What do you think @VRichardJP @mitsudome-r @isamu-takagi @wep21 @esteve ?
@xmfcx I'd prefer not to fork rclcpp_components
and instead inherit/wrap the component manager, and at the same time raise this issue upstream. However, it seems to me that all the components are being loaded at the same time, hence the queue being too short. Would it be possible to load the components sequentially instead? Additionally, we could use event handlers to ensure that a component has been fully loaded before loading the next.
@esteve Thank you for your suggestions.
I'd prefer not to fork
rclcpp_components
and instead inherit/wrap the component manager, and at the same time raise this issue upstream.
Wrapping would require us to change the package names / node executables that we use in the launch files right? If so this would require more changes in our end than using a fork.
However, it seems to me that all the components are being loaded at the same time, hence the queue being too short. Would it be possible to load the components sequentially instead? Additionally, we could use event handlers to ensure that a component has been fully loaded before loading the next.
Even in its current state, Autoware loads slow already. I feel that adding some sort of event handler to wait for each component to load before sending the next would require 2 ways synchronization and increase this delay. I would prefer increasing the queue size and let the service server handle it as fast as possible. The service calls are not big messages anyway.
Also this would end up being a simpler solution too, just increasing the queue size.
@xmfcx I think in your example the single LoadComposableNodes
will load components one by one, while in autoware case, the components are loaded simultaneously from multiple launch files. I can reproduce the issue with the following script:
import launch
from launch_ros.actions import ComposableNodeContainer
from launch_ros.descriptions import ComposableNode
from launch_ros.actions import LoadComposableNodes
def generate_launch_description():
container = ComposableNodeContainer(
name='my_container',
package='rclcpp_components',
executable='component_container',
namespace='',
output='log',
)
loaders = []
for i in range(20):
node = ComposableNode(
package='composition',
plugin='composition::Talker',
name='talker_' + str(i),
)
loaders.append(
LoadComposableNodes(
composable_node_descriptions=[node],
target_container="my_container",
)
)
return launch.LaunchDescription([container, *loaders])
With 10 components, no issue:
$ ros2 component list /my_container | wc -l
10
With 20 components, one is often missing:
$ ros2 component list /my_container | wc -l
19
Wrapping would require us to change the package names / node executables that we use in the launch files right? If so this would require more changes in our end than using a fork.
The problem with forks is that we start using them "as a temporary measure", and then eventually we end up with a huge collection of software that we have to maintain ourselves. It's not just this issue, but if rclcpp_components
gets updates upstream and fixes, we would have to keep our fork in sync.
- Overriding a package and keeping the same name
This does not include the cost of maintaining our own copy and having to synchronize it periodically with upstream.
- Creating a wrapper package and editing all container calls in the launch files
This can probably be automated.
@esteve
I am not a github superuser, but wouldn't it possible to have a clone of rclcpp
that automatically tracks humble
branch and merge latests upstream commits? As @xmfcx proposed, the fork would only need a bunch of COLCON_IGNORE
and to modify 3 lines in a file that is modified once every 2 years. So it would be very unlikely we would run into some merge conflict. Once a fix is available the rclcpp
fork would be removed.
With:
Checklist
Description
Hi, i'm having a problem with pointcloud_container running all the lidars on my system. I have
Expected behavior
Running all lidars in the system with containers
Actual behavior
One of the lidars not working when using the container
Steps to reproduce
Running lidars with pointcloud container in a system with more than 5 lidars
Versions
Possible causes
No response
Additional context
No response