Make it possible to detect publishing times of the messages in the pipelines

brkay54 commented 5 months ago

Checklist

[X] I've read the contribution guidelines.
[X] I've searched other issues and no duplicate issues were found.
[X] I've agreed with the maintainers that I can plan this task.

Description

It is mainly needed for:

https://github.com/autowarefoundation/autoware.universe/issues/6547

cc. @xmfcx @mitsudome-r

In the current Autoware, all the pipelines are designed to use their input's timestamp while publishing output as shown below.

pipeline drawio

However, it causes we can not know the exact timing of the messages to analyze their reaction times. It makes it hard to find bottlenecks inside the pipelines.

While we were developing the tool named reaction_analyzer, we made some changes to the perception, sensing, planning, and control pipeline header times to check nodes' reaction times. (Here you can see the results.) However, it is not possible to change the header times of the output message of the nodes as we discussed in the Perception Sensing WG. However, this situation makes it hard to find the exact timing of the outputs inside the pipeline.

Purpose

Finding new methods to see outputs' exact timings to be able to analyze the nodes' reaction times inside the pipeline.

Possible approaches

Adding a new low-sized message to show nodes' input origin times and output times.

Definition of done

Merge these PRs in order:

VRichardJP commented 5 months ago

Long time ago I wanted to solve the same issue and wrote a simple python script to watch predefined pipelines (= list of topics). For example, if I want to analyze the following pipeline:

/input -> (nodeA) -> /after_A -> (nodeB) -> /after_B -> (nodeC) -> /output

I just subscribe all topics /input, /after_A, /after_B and /output (python is great for that because you can easily subscribe any topic type) and keep track of the time I receive each message. Since data header is not modified over the pipeline, I can use the header timestamp as an ID to associate messages from multiple topics. You can also detect frame drops by finding IDs that never reached the end of the pipeline (e.g. if a later ID reached some topic before the current one). Unless data transfer delay is very unbalanced between input and output, by comparing the time you receive 2 messages with the same ID you get a fairly accurate measurement of the node processing time.

Wouldn't this do the job?

brkay54 commented 5 months ago

@VRichardJP Thanks for the answer, it is a very sensible method. However, what I am trying to do is catch the first messages of each node (predefined checkpoints) to understand which node or pipeline is late to react. (you can see the sample test video here.

Using the origin time of the message as ID is a great idea, however, when I catch the reacted message, also I need its published time. Maybe adding a new small-sized message that includes the origin time and the publishing time to each node would be a solution. We can use the origin times as ID as you said and after catching the reacted message, we can find the publishing time from the other small-sized message by using the header time of the reacted message.

VRichardJP commented 5 months ago

I am not sure I understand the issue. Is the point of collecting publish timestamps to precisely measure ROS2/DDS performance? For example, to measure the blue arrows in the diagram below (based on the method presented above)?

Edit: the "delay" calculation in the diagram above is not correct. It is: delay = data A transfer time - data B transfer time + data A interprocess transfer time Which is roughly delay = data A interprocess transfer time if input and output data have similar size (unless there is a lag at DDS level).

brkay54 commented 5 months ago

@VRichardJP The purpose is not to measure the DDS or ROS 2 performance, the purpose is to measure total system performance and we want to be able to find the bottlenecks in that system.

What we are doing is we are recording two pointcloud messages: pointcloud without object (let say pointcloud A) and pointcloud with an object(pointcloud B). We are sending the Pointcloud A until all Autoware nodes and stacks are ready, after that, we start to publish pointcloud B. The time we started to publish pointcloud B is recorded as spawn_time of the object.

After the object is spawned, we check each node's outputs to see how much time it takes to show us the object as output. By using this method, we can see all system performance, and also we can check each node's reaction to analyze the bottlenecks in that system.

I hope I can explain the purpose, please let me know if you have any questions.

VRichardJP commented 5 months ago

@brkay54 thank you I understand now.

brkay54 commented 5 months ago

We had a meeting with @xmfcx and @mitsudome-r, We discussed about the possible solutions:

Possible Solutions: Option - 1: Modify each node to add a new debug topic publisher. The topic will contain:

Original header timestamp
timestamp of published timing

Reaction analyzer will subscribe to the main topic and the debug topic with exact time filter to overwrite published time internally to the main topic. This topic can be turned on/off by a parameter.

Option - 2 Modify each node to do the condition checks that’s being done in reaction analyzer e.g., if the node detects an object in trajectory, it will publish “true” in the debug topic with the timestamp.

Reaction analyzer subscribes to the debug topic to calculate the time difference between the timestamp in the debug topic.

We decided to implement the Option - 1. We are going to add a new debug publisher in each node with a turn-on/off flag. This option is going to be present in Perception/Sensing WG.

autowarefoundation / autoware.universe