feat: add autoware_node and autoware_control_center

xmfcx commented 1 year ago

Signed-off-by: M. Fatih Cırıt mfc@leodrive.ai

Description

This PR is for creating the base class for all the future Autoware Core nodes to inherit from.

This node will be inheriting from the rclcpp_lifecycle::LifecycleNode with (LifeCycle Readme).

I'll be taking ros2 nav2 stack as reference too while combining Lifecycle and Composable node concepts.

While developing this PR, I am going to explore ideas mentioned in previous conversations in the Autoware community.

Communication architecture

### Registration Service **Service server:** `ACC` **Service client:** `AutowareNode` **Roles:** - Register the `AutowareNode` in `ACC` - Receive the UUID for the `AutowareNode` - If already registered, return error ```mermaid sequenceDiagram participant acc as Autoware Control Center participant node as node_a Note right of node: cli_reg node ->>+ acc: name: node_a Note left of acc: srv_reg acc ->> acc: generate a UUID for the node_a acc ->> acc: create a bond between acc and node_a Note left of acc: srv_reg acc -->>- node: UUID and registration success bool Note right of node: cli_reg ``` ### Heart beat bond ~~https://github.com/ros/bond_core/blob/ros2/ros2_migration_readme.md~~ Slow https://github.com/ros-safety/software_watchdogs https://design.ros2.org/articles/qos_deadline_liveliness_lifespan.html https://docs.ros.org/en/rolling/Concepts/About-Quality-of-Service-Settings.html#qos-events **Roles:** - As long as the `AutowareNode` is alive, it reports with this bond - If the `AutowareNode` dies, it triggers the `on_broken` callback ### Reporting service **Service server:** `ACC` **Service client:** `AutowareNode` **Roles:** - Reporting state of the `AutowareNode` to the `ACC` - Warning - Error ### Control service **Service server:** `AutowareNode` **Service client:** `ACC` Roles: - Shutting down the `AutowareNode` - Toggle `AutowareNode` between active-inactive?

Tasks

[x] Implement registration service
- [x] Define the interface
- [x] Create the client
- [x] Create the service
- [x] Make them communicate
- [x] Generate UUID
- [x] Send UUID
[ ] Implement control service
[x] Implement reporting service
[x] Implement watchdog - heartbeat

Tests performed

PR Review Items

## Notes for reviewers ## Pre-review checklist for the PR author The PR author **must** check the checkboxes below when creating the PR. - [x] I've confirmed the [contribution guidelines]. - [x] The PR follows the [pull request guidelines]. ## In-review checklist for the PR reviewers The PR reviewers **must** check the checkboxes below before approval. - [ ] The PR follows the [pull request guidelines]. - [ ] The PR has been properly tested. - [ ] The PR has been reviewed by the code owners. ## Post-review checklist for the PR author The PR author **must** check the checkboxes below before merging. - [ ] There are no open discussions or they are tracked via tickets. - [ ] The PR is ready for merge. After all checkboxes are checked, anyone who has write access can merge the PR. [contribution guidelines]: https://autowarefoundation.github.io/autoware-documentation/main/contributing/ [pull request guidelines]: https://autowarefoundation.github.io/autoware-documentation/main/contributing/pull-request-guidelines/

codecov[bot] commented 1 year ago

Codecov Report

Attention: Patch coverage is 13.97059% with 351 lines in your changes are missing coverage. Please review.

Please upload report for BASE (main@9989140). Learn more about missing BASE report.

:exclamation: Current head e4f5fca differs from pull request most recent head ed2efdb

Please upload reports for the commit ed2efdb to get more accurate results.

Files	Patch %	Lines
...are_control_center/src/autoware_control_center.cpp	19.25%	65 Missing and 44 partials :warning:
common/autoware_node/src/autoware_node.cpp	0.00%	94 Missing :warning:
...ntrol_center/test/test_autoware_control_center.cpp	14.66%	6 Missing and 58 partials :warning:
...utoware_control_center/test/test_node_registry.cpp	18.42%	0 Missing and 31 partials :warning:
common/test_node/src/test_pub.cpp	0.00%	22 Missing :warning:
...mmon/autoware_control_center/src/node_registry.cpp	56.52%	3 Missing and 7 partials :warning:
common/test_node/src/test_node.cpp	0.00%	10 Missing :warning:
...ontrol_center/src/autoware_control_center_node.cpp	0.00%	9 Missing :warning:
...utoware_control_center/autoware_control_center.hpp	0.00%	1 Missing :warning:
.../include/autoware_control_center/node_registry.hpp	0.00%	1 Missing :warning:

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #73 +/- ## ======================================= Coverage ? 13.97% ======================================= Files ? 10 Lines ? 408 Branches ? 189 ======================================= Hits ? 57 Misses ? 211 Partials ? 140 ``` | [Flag](https://app.codecov.io/gh/autowarefoundation/autoware.core/pull/73/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=autowarefoundation) | Coverage Δ | | |---|---|---| | [differential](https://app.codecov.io/gh/autowarefoundation/autoware.core/pull/73/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=autowarefoundation) | `13.97% <13.97%> (?)` | |

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

stale[bot] commented 1 year ago

This pull request has been automatically marked as stale because it has not had recent activity.

xmfcx commented 1 year ago

@JianKangEgon I've also added you as an assignee too, after talking with @liuXinGangChina If you have any questions, please let me know.

JianKangEgon commented 1 year ago

@JianKangEgon I've also added you as an assignee too, after talking with @liuXinGangChina If you have any questions, please let me know.

OK, thanks @xmfcx

xmfcx commented 1 year ago

@Tom-Li-Lee to be assigned too.

stale[bot] commented 10 months ago

This pull request has been automatically marked as stale because it has not had recent activity.

xmfcx commented 9 months ago

Priority list:

Registering the node to ACC
1. De-register / Re-register
2. Handle different orders of launching
Heartbeat/liveliness for the nodes, reports in last call milliseconds
Receive error states in ACC (through a service call) (Node can report something is wrong with itself)
Monitored Sub
1. Creates a deadline callback which can either send (info/warning/error) through the previously implemented error handling
Lifecycle
1. delayed/controlled activate
2. shutdown
3. activate/deactivate cycles

lexavtanke commented 9 months ago

@xmfcx I wrote some thoughts about how to handle different orders of launching. Could you please give some feedback on it?

Each node has flag Registered and callback for registration (call to register service). Callback has service availability check (skip iteration). If the Registered is true, the timer stops. At the start up, the Registered is false. Each node has a re-register service (server) which will set Registered to false after service call.

Case ACC starts first. Nodes are not registered, so they send calls to register service. Get return from service and set Register true.

Case ACC starts after Nodes. Nodes are not registered, service calls in timer callback are skipped until ACC starts and service is available. As ACC is available, all happens as in Case ACC starts first.

Case ACC dies after All Nodes have been registered. ACC has a timeout after launching. If none node has registered during this time ACC will list all nodes with re-registered service and will send calls. So all nodes will be unregistered and all will work as in Case ACC starts first.

xmfcx commented 9 months ago

Each node has a re-register service (server) which will set Registered to false after service call.

I see, this makes sense :+1:. It seems we already have the unregister_node() (should rename to deregister_node()) function in the node_registry.hpp. The AutowareNode instances can have deregister() function too.

UUID for the ACC

There is also is_registered() function which returns a boolean. (Maybe it's better if it returned a std::optional<UUID> ?)

This brings up a new topic: what do they register to? So it could make sense to have a UUID for each ACC too. So when a node registers, it will remember who it registered to. And when someone asks to an AutowareNode, it can respond with its corresponding ACC instance.

When faulty ACC dies and new ACC joins, the new ACC can list the nodes with bad ACC. It would deregister them and those nodes would register automatically to the new ACC.

Heartbeat from ACC

Additionally, ACC can have a heartbeat and all the AutowareNode instances would subscribe to this. And if it passes a deadline, they would automatically deregister themselves.

Registration async loop in the AutowareNode

Each node has flag Registered and callback for registration (call to register service). Callback has service availability check (skip iteration). If the Registered is true, the timer stops. At the start up, the Registered is false. Each node has a re-register service (server) which will set Registered to false after service call.

We shouldn't need a re-register service. Just a register service should be enough. I might also be missing something, what do you think?

Also it would be important to have this registration checker loop to be an async operation. Like a background worker. So that it wouldn't affect the normal operation of the AutowareNode.

Thanks for working on this task! :heart_decoration:

lexavtanke commented 9 months ago

@xmfcx Thank you for your thoughts.

It is a really good idea to have

UUID for ACC

so it will be easy to distinguish nodes which were registered to an old ACC.

Here is another case to consider. What if node has some issues during start-up and it will not be able to register? If so ACC will not know that node should even exist. Maybe ACC should have access to list of launching nodes?

As I remember you did some work in this direction, didn't you?

We shouldn't need a re-register service. Just a register service should be enough. I might also be missing something, what do you think?

It seems so for me. At least for now I think all cases can be reduced to register if node can receive information that ACC has died.

Do you think that heartbeat to all all AutowareNodes is a better solution than to have service call to AutowareNodes from new ACC? Re-registering should not be a frequent case. It is kinda emergency. We will not have any additional overhead but it will be significant during the re-registration process in normal situation.

In the case of ACC heartbeat we will always have overhead.

xmfcx commented 9 months ago

Expected nodes list in ACC

In the future, we can implement a feature like this. But for now, ACC doesn't need to know what will launch.

As I remember you did some work in this direction, didn't you?

I thought about this potential but didn't do anything so far.

Deregistration conditions (from AutowareNode perspective)

Let's fix the terminology first (I also edited my previous posts): https://english.stackexchange.com/questions/25931/unregister-vs-deregister

state=unregistered -> 'register' -> state=registered -> 'deregister' -> state=unregistered`

We could have multiple conditions for deregistration.

deregister service
timeout from the ACC heartbeat deadline

We should definitely have the deregister service. And we can keep the ACC heartbeat implementation on hold for now.

lexavtanke commented 8 months ago

Latest updates.

Heartbeat publishing was added to Autoware node class. It works as regular ros publisher on node_name/heartbeat topic with with additional QoS settings. So subscriber will be able to receive signals if that settings will be violated.

Autoware Control Center will subscribe to Autoware node heartbeat topic during node registration and will store pointers to subscriber and qos settings for particular Autoware node in node registry along with node_name and uuid. So registration callback in Autoware Control Center node register service should be updated.

lexavtanke commented 8 months ago

@xmfcx Could you please elaborate?

2. reports in last call milliseconds

As for now there is heartbeat topic for each instance of Autoware node. ACC subscribes to this topic after successful node registration. For now callback of the sub does nothing just print heartbeat timestamp for debugging. https://github.com/autowarefoundation/autoware.core/blob/125e31f27997db5b52803c1cf8a3e7fff19b1f7a/common/autoware_control_center/src/autoware_control_center.cpp#L209

Do you think that ACC should have a data structure to store last received time stamp from each node? So it will be possible to provide a report with list of nodes, their liveliness and last received heartbeat timestamp. It can be done with providing separate srv.

Also each of heartbeat subscribers has callback on the event of violation QoS (liveliness_callback). https://github.com/autowarefoundation/autoware.core/blob/125e31f27997db5b52803c1cf8a3e7fff19b1f7a/common/autoware_control_center/src/autoware_control_center.cpp#L191

We can use only this callback to manage liveliness of nodes in this data structure. If so we will update status of the node in this callback and store timestamp of the violation. If there is no violation we will not write to the structure.

xmfcx commented 8 months ago

reports in last call milliseconds

I've meant "time passed since last heartbeat" here.

Do you think that ACC should have a data structure to store last received time stamp from each node? So it will be possible to provide a report with list of nodes, their liveliness and last received heartbeat timestamp. It can be done with providing separate srv.

The ACC(Autoware Control Center) should keep track of the last heartbeat received for each AN(AutowareNode) and periodically report the time passed since last beat received from the AN.

Maybe we could have a report message (autoware_control_center_msgs::AutowareNodeReport) in the ACC that lists the:

Node Identifier: Unique ID or name for each AN
Connectivity State:
- Connected: Active communication link.
- Disconnected: Communication lost.
- Never Connected: No record of past connection.
Health State:
- Unknown: Status unclear due to communication issues or lack of updates.
- Healthy: Operating within normal parameters.
- Warning: Operating under suboptimal conditions.
- Error: Critical issues requiring immediate attention.
Time Since Last Heartbeat

And we would have an array of these messages in a single message called autoware_control_center_msgs::AutowareNodeReports which is periodically published.

We can use only this callback to manage liveliness of nodes in this data structure. If so we will update status of the node in this callback and store timestamp of the violation. If there is no violation we will not write to the structure.

This is a good idea but this callback will be probably called very often once a node dies. So it might be better to avoid adding a liveliness callback to the heartbeat messages. Instead, when we receive a heartbeat, just record the stamp. And when generating a report, we can calculate the time passed since last communication for each AN.

lexavtanke commented 8 months ago

@xmfcx Thank you for your thoughts!

Maybe we could have a report message (autoware_control_center_msgs::AutowareNodeReport) in the ACC that lists the:

Yes, I have the same idea. And I have started to implement it.

this callback will be probably called very often once a node dies

Actually it called only on state change as I tested it before. So It called ones then node died. But I will think how it will be better to implement it.

lexavtanke commented 7 months ago

Fixed double calling deregister service during ACC start-up procedure. AutowareNodeReports publishing was implemented in ACC. Error state receiver (service) in ACC is a wip.

lexavtanke commented 7 months ago

Implemented AutowareNodeError service in ACC. Implemented AutowareNodeError client in AutowareNode.

For now AutowareNode is able to send it state to ACC in form of AutowareNodeState msg with 4 main states:

UNKNOWN
NORMAL
WARNING
ERROR

lexavtanke commented 7 months ago

Implemented monitored subscription in AutowareNode https://github.com/autowarefoundation/autoware.core/pull/73/commits/b59c925846e450742d54397ed3aa25acead5cf57 but It produce error during run time in test_node

/home/alex/autoware.core/install/test_node/lib/test_node/test_node: symbol lookup error: /home/alex/autoware.core/install/test_node/lib/libtest_node_core.so: undefined symbol: _ZN13autoware_node12AutowareNode24test_create_subscriptionIN8std_msgs3msg7String_ISaIvEEESt5_BindIFMN9test_node8TestNodeEFvSt10shared_ptrIS6_EEPS9_St12_PlaceholderILi1EEEES5_N6rclcpp12SubscriptionIS6_S5_S6_S6_NSJ_23message_memory_strategy21MessageMemoryStrategyIS6_S5_EEEESN_EESA_IT2_ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKNSJ_3QoSEOT0_RKNSJ_32SubscriptionOptionsWithAllocatorIT1_EENT3_9SharedPtrE
[ros2run]: Process exited with failure 127

Test_node is not able to call function from AutowareNode. It seem connected to linking library error.

lexavtanke commented 7 months ago

With @xmfcx we identified the root cause of the issue to be the separation of the declaration and definition of a templated function across .cpp and .hpp files.

This approach led to linkage problems. By consolidating both the declaration and definition within the .hpp file and eliminating the definition from the .cpp file, we were able to resolve the issue effectively.

I use some utility functions for work with uuid from tier4_autoware_utility package. There was a build error because of tier4_autoware_utils as it is the part of autoware.universe and it is not presented in autoware.core.

We talked with @xmfcx about the issue and decided to port the uuid functions to autoware_common as it autoware.core already depend on it. I put uuid helper functions to autoware_utils. It was done in the PR .

For now the PR is ready for review but there are some drawbacks:

I didn't switch RCLCPP_INFO messages to RCLCPP_DEBUG for more easy testing.
There is the boost dependency issue. It seems like there is no boost lib installed in the container.
I think it is better to introduce Lifecycle control in a next PR after this one will be merged and we try to use it to implement some nodes to see how it works with a real package.

lexavtanke commented 7 months ago

Instruction how-to launch autoware_control_center and autoware_node (test_node).

There is no dedicated launch file for autoware_control_center. So you need to run it with command: ros2 run autoware_control_center autoware_control_center

After you run it you need to start single test autoware node by command: ros2 run test_node test_node or 4 instances with different names and namespace by launch file: ros2 launch test_node autoware_nodes.launch Also you can start node first.

There is startup timer (for 10 s) in ACC which will send deregistration call (only once) to all autoware_nodes presented in the system in the case if previous ACC has died and no one node will registered during this period. If there is no nodes in the system nothing will happen. ACC will just stay alive and wait until some node will registered.

ACC publish topic /autoware_control_center/autoware_node_reports with current state of registered nodes.

There are simple publisher node (test_pub) with proper Qos profile to test monitored subscription function. By default it publish to /topic with proper frequency of 10 Hz and it is possible to change it with param hz. To run test_pub use command: ros2 run test_node test_pub --ros-args -p hz:=10

Each test_node has monitored subscriber configured to receive messages on /topic with 10 hz if this condition is violated node will call service in ACC and will report error state.

JianKangEgon commented 7 months ago

hi @lexavtanke , I just followed your instructions and tried the test node. Everything works fine and the nodes defined in "autoware_nodes.launch" are registered or de-registered when I start or shut down the launch file. However, when I want to run a single test node using

ros2 run test_node test_node

It looks okay, but after I shut down the node and restarted it, the log shows below

The left window represents the ACC node, and the right represents the test node. There is always an error log even though the ACC looks normal.

Based on this attempt, I comment out the "test_node1", "test_node2" and "test_node3" in "autoware_nodes.launch" and only remain the "test_node", then perform start & shutdown & restart. This time, both the ACC node and the test node look good. There is nothing special in the launch file, so I'm curious about the difference.

lexavtanke commented 7 months ago

Hi @JianKangEgon, thank you for your interest in this work and your feedback.

First of all, the case then ACC kept alive and only several autoware_nodes will be relaunched was out of the scope. So error is produced because ACC already has such a name_node in the register. It is possible to fix this behaviour if there is a need for it.

As in my experience I usually relaunch a whole system during a test.

For the difference between using the launch file and ros2 run test_node test_node there is only difference in color highlight of the terminal. It doesn't color errors and warning to yellow and red if started with the launch file. But it still prints them out to the console.

If you launch and relaunch ACC and test_node with the same launch file you restart both of them. So the register in ACC is empty after restart.

If I didn't understand you correctly could you please elaborate.

JianKangEgon commented 7 months ago

Hi @JianKangEgon, thank you for your interest in this work and your feedback.

First of all, the case then ACC kept alive and only several autoware_nodes will be relaunched was out of the scope. So error is produced because ACC already has such a name_node in the register. It is possible to fix this behaviour if there is a need for it.

As in my experience I usually relaunch a whole system during a test.

For the difference between using the launch file and ros2 run test_node test_node there is only difference in color highlight of the terminal. It doesn't color errors and warning to yellow and red if started with the launch file. But it still prints them out to the console.

If you launch and relaunch ACC and test_node with the same launch file you restart both of them. So the register in ACC is empty after restart.

If I didn't understand you correctly could you please elaborate.

OK, thanks for your quick reply.

Further update: in this case, is it possible to display the info log instead of the error log? Since the node has been registered, the error log may cause unexpected misunderstandings.

lexavtanke commented 6 months ago

I’ve added uuid_node last_node_state and node_report fields to AutowareNodeReport message.

I’ve added logic to manage relaunch autoware_node without relaunching autoware_control_center. If ANode is relaunched and sends register request ACC updated information (uuid) in register and increase counter field (num_registered) by one. ANode receives success code from ACC. ACC warn log message for such case changed to info message.

I’ve added call AutowareNodeError service in ANode constructor to send NORMAL state to ACC after node creation.

xmfcx commented 6 months ago

@ahmeddesokyebrahim could you also review this?

brkay54 commented 6 months ago

@lexavtanke Could you fix the CI check errors?

lexavtanke commented 6 months ago

brkay54 Hi, which one? If you are talking about uncrustify. It works a bit weird here as all errors are introduced by style(pre-commit): autofix

brkay54 commented 6 months ago

@lexavtanke I mean uncrustify and DCO. I saw the problem for uncrustify. Thank you.

lexavtanke commented 6 months ago

brkay54 DCO errors belong to @xmfcx commits.

ahmeddesokyebrahim commented 5 months ago

Everything look very fine.

I have only 1 question. Is there a reason we are not using doxygen comments like that

// For methods/functions
/**
 * @brief Brief description of the method.
 *
 * Detailed description (if needed) goes here.
 * @param[in] param1 Description of parameter 1.
 * @param[out] param2 Description of parameter 2 (if any).
 * @return Description of return value (if any).
 */

// For variables
/**
 * @brief Brief description of the variable.
 *
 * Detailed description (if needed) goes here.
 */

This style is already used in Autoware BTW, and here is one example.

Finally, thanks a lot for the amazing efforts and great work done in this PR :+1:

lexavtanke commented 5 months ago

@ahmeddesokyebrahim I think there is no particular reason as there are several different styles of comments in Autoware.Universe. There is a lack of consistency)

mitsudome-r commented 5 months ago

Please fix the failing CI with uncrustify and we can merge the PR.

xmfcx commented 5 months ago

@lexavtanke Why is there a startup timer in the ACC?

In the readme:

ACC has startup timer and waits for 10 sec for any node to be registered. If at least one node is registered ACC starts normal work. If not ACC will think that it was relaunched after crash and will start re-register procedure.

Does this mean it won't register new nodes after the initial startup timer?

Registration and deregistration should work regardless of this timer. I think we can remove this initialization timer.

xmfcx commented 5 months ago

Could you also fix the clang-tidy warnings? https://github.com/autowarefoundation/autoware.core/actions/runs/8692780358/job/23838263634?pr=73

lexavtanke commented 5 months ago

@xmfcx Thank you for your review.

Why is there a startup timer in the ACC?

The purpose of the startup timer is to manage the case then ACC was some how re-launched without re-launch the rest of the system. So nodes don't know that there is a new ACC and they need to re-register themself. ACC don't know about it as well. And timer start re-registering in that case.

During the regular startup (the whole system) registration and de-registration work fine and the timer doesn't affect them.

xmfcx commented 5 months ago

@lexavtanke I understand. But we should avoid these kind of timing based alignments and instead solve it with an event-driven architecture.

In case of:

New ACC emits a deregister signal to all ANs (as a startup event)
ANs launched
ANs are registering to the ACC
ACC died
ACC quickly relaunched
New ACC emits a deregister signal to all ANs (as a startup event)
ANs should register to the new ACC

This way, regardless of timing, when a new ACC comes out, it makes sure everyone registers from scratch.

lexavtanke commented 5 months ago

@xmfcx It seems like we have an issue with cpplint related to https://github.com/cpplint/cpplint/issues/184 after we suppress clang-tidy with // NOLINTNEXTLINE(performance-unnecessary-value-param)

xmfcx commented 5 months ago

It seems they have added clang-analyzer- to the list of ignored prefixes for cpplint.

https://github.com/cpplint/cpplint/blob/e3a1938af916506c1cd658c2f464b2837597b81c/cpplint.py#L382-L408

But ours is has a performance- prefix.

Which is not in the error categories listed here:

https://github.com/cpplint/cpplint/blob/e3a1938af916506c1cd658c2f464b2837597b81c/cpplint.py#L295-L364

I couldn't find a way to add our check to this list. Easiest solution is to just add it as // NOLINTNEXTLINE without the name.

Another could be to disable cpplint altogether but I won't do that for now.

Edit:

Oh it seems they have added performance- to the list too with https://github.com/cpplint/cpplint/commit/8eeaf2352ad474c011339342022df7d278ac43c5