Closed xmfcx closed 4 months ago
Attention: Patch coverage is 13.97059%
with 351 lines
in your changes are missing coverage. Please review.
Please upload report for BASE (
main@9989140
). Learn more about missing BASE report.:exclamation: Current head e4f5fca differs from pull request most recent head ed2efdb
Please upload reports for the commit ed2efdb to get more accurate results.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
This pull request has been automatically marked as stale because it has not had recent activity.
@JianKangEgon I've also added you as an assignee too, after talking with @liuXinGangChina If you have any questions, please let me know.
@JianKangEgon I've also added you as an assignee too, after talking with @liuXinGangChina If you have any questions, please let me know.
OK, thanks @xmfcx
@Tom-Li-Lee to be assigned too.
This pull request has been automatically marked as stale because it has not had recent activity.
Priority list:
@xmfcx I wrote some thoughts about how to handle different orders of launching. Could you please give some feedback on it?
Each node has flag Registered and callback for registration (call to register service). Callback has service availability check (skip iteration). If the Registered is true, the timer stops. At the start up, the Registered is false. Each node has a re-register service (server) which will set Registered to false after service call.
Case ACC starts first. Nodes are not registered, so they send calls to register service. Get return from service and set Register true.
Case ACC starts after Nodes. Nodes are not registered, service calls in timer callback are skipped until ACC starts and service is available. As ACC is available, all happens as in Case ACC starts first.
Case ACC dies after All Nodes have been registered. ACC has a timeout after launching. If none node has registered during this time ACC will list all nodes with re-registered service and will send calls. So all nodes will be unregistered and all will work as in Case ACC starts first.
Each node has a re-register service (server) which will set Registered to false after service call.
I see, this makes sense :+1:. It seems we already have the unregister_node()
(should rename to deregister_node()
) function in the node_registry.hpp
. The AutowareNode instances can have deregister()
function too.
There is also is_registered()
function which returns a boolean. (Maybe it's better if it returned a std::optional<UUID>
?)
This brings up a new topic: what do they register to? So it could make sense to have a UUID for each ACC too. So when a node registers, it will remember who it registered to. And when someone asks to an AutowareNode, it can respond with its corresponding ACC instance.
When faulty ACC dies and new ACC joins, the new ACC can list the nodes with bad ACC. It would deregister them and those nodes would register automatically to the new ACC.
Additionally, ACC can have a heartbeat and all the AutowareNode instances would subscribe to this. And if it passes a deadline, they would automatically deregister themselves.
Each node has flag Registered and callback for registration (call to register service). Callback has service availability check (skip iteration). If the Registered is true, the timer stops. At the start up, the Registered is false. Each node has a re-register service (server) which will set Registered to false after service call.
We shouldn't need a re-register
service. Just a register
service should be enough. I might also be missing something, what do you think?
Also it would be important to have this registration checker loop to be an async operation. Like a background worker. So that it wouldn't affect the normal operation of the AutowareNode.
Thanks for working on this task! :heart_decoration:
@xmfcx Thank you for your thoughts.
It is a really good idea to have
UUID for ACC
so it will be easy to distinguish nodes which were registered to an old ACC.
Here is another case to consider. What if node has some issues during start-up and it will not be able to register? If so ACC will not know that node should even exist. Maybe ACC should have access to list of launching nodes?
As I remember you did some work in this direction, didn't you?
We shouldn't need a re-register service. Just a register service should be enough. I might also be missing something, what do you think?
It seems so for me. At least for now I think all cases can be reduced to register
if node can receive information that ACC has died.
Do you think that heartbeat to all all AutowareNodes is a better solution than to have service call to AutowareNodes from new ACC? Re-registering should not be a frequent case. It is kinda emergency. We will not have any additional overhead but it will be significant during the re-registration process in normal situation.
In the case of ACC heartbeat we will always have overhead.
In the future, we can implement a feature like this. But for now, ACC doesn't need to know what will launch.
As I remember you did some work in this direction, didn't you?
I thought about this potential but didn't do anything so far.
Let's fix the terminology first (I also edited my previous posts): https://english.stackexchange.com/questions/25931/unregister-vs-deregister
state=unregistered -> 'register' -> state=registered -> 'deregister' -> state=unregistered`
We could have multiple conditions for deregistration.
deregister
serviceWe should definitely have the deregister
service. And we can keep the ACC heartbeat implementation on hold for now.
Latest updates.
Heartbeat publishing was added to Autoware node class. It works as regular ros publisher on node_name/heartbeat topic with with additional QoS settings. So subscriber will be able to receive signals if that settings will be violated.
Autoware Control Center will subscribe to Autoware node heartbeat topic during node registration and will store pointers to subscriber and qos settings for particular Autoware node in node registry along with node_name and uuid. So registration callback in Autoware Control Center node register service should be updated.
@xmfcx Could you please elaborate?
2. reports in last call milliseconds
As for now there is heartbeat topic for each instance of Autoware node. ACC subscribes to this topic after successful node registration. For now callback of the sub does nothing just print heartbeat timestamp for debugging. https://github.com/autowarefoundation/autoware.core/blob/125e31f27997db5b52803c1cf8a3e7fff19b1f7a/common/autoware_control_center/src/autoware_control_center.cpp#L209
Do you think that ACC should have a data structure to store last received time stamp from each node? So it will be possible to provide a report with list of nodes, their liveliness and last received heartbeat timestamp. It can be done with providing separate srv.
Also each of heartbeat subscribers has callback on the event of violation QoS (liveliness_callback). https://github.com/autowarefoundation/autoware.core/blob/125e31f27997db5b52803c1cf8a3e7fff19b1f7a/common/autoware_control_center/src/autoware_control_center.cpp#L191
We can use only this callback to manage liveliness of nodes in this data structure. If so we will update status of the node in this callback and store timestamp of the violation. If there is no violation we will not write to the structure.
- reports in last call milliseconds
I've meant "time passed since last heartbeat" here.
Do you think that ACC should have a data structure to store last received time stamp from each node? So it will be possible to provide a report with list of nodes, their liveliness and last received heartbeat timestamp. It can be done with providing separate srv.
The ACC(Autoware Control Center) should keep track of the last heartbeat received for each AN(AutowareNode) and periodically report the time passed since last beat received from the AN.
Maybe we could have a report message (autoware_control_center_msgs::AutowareNodeReport
) in the ACC that lists the:
And we would have an array of these messages in a single message called autoware_control_center_msgs::AutowareNodeReports
which is periodically published.
We can use only this callback to manage liveliness of nodes in this data structure. If so we will update status of the node in this callback and store timestamp of the violation. If there is no violation we will not write to the structure.
This is a good idea but this callback will be probably called very often once a node dies. So it might be better to avoid adding a liveliness callback to the heartbeat messages. Instead, when we receive a heartbeat, just record the stamp. And when generating a report, we can calculate the time passed since last communication for each AN.
@xmfcx Thank you for your thoughts!
Maybe we could have a report message (
autoware_control_center_msgs::AutowareNodeReport
) in the ACC that lists the:
Yes, I have the same idea. And I have started to implement it.
this callback will be probably called very often once a node dies
Actually it called only on state change as I tested it before. So It called ones then node died. But I will think how it will be better to implement it.
Fixed double calling deregister service during ACC start-up procedure. AutowareNodeReports publishing was implemented in ACC. Error state receiver (service) in ACC is a wip.
Implemented AutowareNodeError service in ACC. Implemented AutowareNodeError client in AutowareNode.
For now AutowareNode is able to send it state to ACC in form of AutowareNodeState msg with 4 main states:
Implemented monitored subscription in AutowareNode https://github.com/autowarefoundation/autoware.core/pull/73/commits/b59c925846e450742d54397ed3aa25acead5cf57 but It produce error during run time in test_node
/home/alex/autoware.core/install/test_node/lib/test_node/test_node: symbol lookup error: /home/alex/autoware.core/install/test_node/lib/libtest_node_core.so: undefined symbol: _ZN13autoware_node12AutowareNode24test_create_subscriptionIN8std_msgs3msg7String_ISaIvEEESt5_BindIFMN9test_node8TestNodeEFvSt10shared_ptrIS6_EEPS9_St12_PlaceholderILi1EEEES5_N6rclcpp12SubscriptionIS6_S5_S6_S6_NSJ_23message_memory_strategy21MessageMemoryStrategyIS6_S5_EEEESN_EESA_IT2_ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKNSJ_3QoSEOT0_RKNSJ_32SubscriptionOptionsWithAllocatorIT1_EENT3_9SharedPtrE
[ros2run]: Process exited with failure 127
Test_node is not able to call function from AutowareNode. It seem connected to linking library error.
With @xmfcx we identified the root cause of the issue to be the separation of the declaration and definition of a templated function across .cpp and .hpp files.
This approach led to linkage problems. By consolidating both the declaration and definition within the .hpp file and eliminating the definition from the .cpp file, we were able to resolve the issue effectively.
I use some utility functions for work with uuid from tier4_autoware_utility package. There was a build error because of tier4_autoware_utils as it is the part of autoware.universe and it is not presented in autoware.core.
We talked with @xmfcx about the issue and decided to port the uuid functions to autoware_common as it autoware.core already depend on it. I put uuid helper functions to autoware_utils. It was done in the PR .
For now the PR is ready for review but there are some drawbacks:
Instruction how-to launch autoware_control_center and autoware_node (test_node).
There is no dedicated launch file for autoware_control_center. So you need to run it with command:
ros2 run autoware_control_center autoware_control_center
After you run it you need to start single test autoware node by command:
ros2 run test_node test_node
or 4 instances with different names and namespace by launch file:
ros2 launch test_node autoware_nodes.launch
Also you can start node first.
There is startup timer (for 10 s) in ACC which will send deregistration call (only once) to all autoware_nodes presented in the system in the case if previous ACC has died and no one node will registered during this period. If there is no nodes in the system nothing will happen. ACC will just stay alive and wait until some node will registered.
ACC publish topic /autoware_control_center/autoware_node_reports
with current state of registered nodes.
There are simple publisher node (test_pub) with proper Qos profile to test monitored subscription function. By default it publish to /topic
with proper frequency of 10 Hz and it is possible to change it with param hz.
To run test_pub use command:
ros2 run test_node test_pub --ros-args -p hz:=10
Each test_node has monitored subscriber configured to receive messages on /topic
with 10 hz if this condition is violated node will call service in ACC and will report error state.
hi @lexavtanke , I just followed your instructions and tried the test node. Everything works fine and the nodes defined in "autoware_nodes.launch" are registered or de-registered when I start or shut down the launch file. However, when I want to run a single test node using
ros2 run test_node test_node
It looks okay, but after I shut down the node and restarted it, the log shows below
The left window represents the ACC node, and the right represents the test node. There is always an error log even though the ACC looks normal.
Based on this attempt, I comment out the "test_node1", "test_node2" and "test_node3" in "autoware_nodes.launch" and only remain the "test_node", then perform start & shutdown & restart. This time, both the ACC node and the test node look good. There is nothing special in the launch file, so I'm curious about the difference.
Hi @JianKangEgon, thank you for your interest in this work and your feedback.
First of all, the case then ACC kept alive and only several autoware_nodes will be relaunched was out of the scope. So error is produced because ACC already has such a name_node in the register. It is possible to fix this behaviour if there is a need for it.
As in my experience I usually relaunch a whole system during a test.
For the difference between using the launch file and ros2 run test_node test_node
there is only difference in color highlight of the terminal. It doesn't color errors and warning to yellow and red if started with the launch file. But it still prints them out to the console.
If you launch and relaunch ACC and test_node with the same launch file you restart both of them. So the register in ACC is empty after restart.
If I didn't understand you correctly could you please elaborate.
Hi @JianKangEgon, thank you for your interest in this work and your feedback.
First of all, the case then ACC kept alive and only several autoware_nodes will be relaunched was out of the scope. So error is produced because ACC already has such a name_node in the register. It is possible to fix this behaviour if there is a need for it.
As in my experience I usually relaunch a whole system during a test.
For the difference between using the launch file and
ros2 run test_node test_node
there is only difference in color highlight of the terminal. It doesn't color errors and warning to yellow and red if started with the launch file. But it still prints them out to the console.If you launch and relaunch ACC and test_node with the same launch file you restart both of them. So the register in ACC is empty after restart.
If I didn't understand you correctly could you please elaborate.
OK, thanks for your quick reply.
Further update: in this case, is it possible to display the info log instead of the error log? Since the node has been registered, the error log may cause unexpected misunderstandings.
I’ve added uuid_node last_node_state and node_report fields to AutowareNodeReport
message.
I’ve added logic to manage relaunch autoware_node
without relaunching autoware_control_center
. If ANode is relaunched and sends register request ACC updated information (uuid) in register and increase counter field (num_registered
) by one. ANode receives success code from ACC.
ACC warn log message for such case changed to info message.
I’ve added call AutowareNodeError
service in ANode constructor to send NORMAL
state to ACC after node creation.
@ahmeddesokyebrahim could you also review this?
@lexavtanke Could you fix the CI check errors?
brkay54 Hi, which one? If you are talking about uncrustify. It works a bit weird here as all errors are introduced by style(pre-commit): autofix
@lexavtanke I mean uncrustify
and DCO
. I saw the problem for uncrustify
. Thank you.
brkay54 DCO
errors belong to @xmfcx commits.
Everything look very fine.
I have only 1 question. Is there a reason we are not using doxygen comments like that
// For methods/functions
/**
* @brief Brief description of the method.
*
* Detailed description (if needed) goes here.
* @param[in] param1 Description of parameter 1.
* @param[out] param2 Description of parameter 2 (if any).
* @return Description of return value (if any).
*/
// For variables
/**
* @brief Brief description of the variable.
*
* Detailed description (if needed) goes here.
*/
This style is already used in Autoware BTW, and here is one example.
Finally, thanks a lot for the amazing efforts and great work done in this PR :+1:
Please fix the failing CI with uncrustify and we can merge the PR.
@lexavtanke Why is there a startup timer in the ACC?
In the readme:
ACC has startup timer and waits for 10 sec for any node to be registered. If at least one node is registered ACC starts normal work. If not ACC will think that it was relaunched after crash and will start re-register procedure.
Does this mean it won't register new nodes after the initial startup timer?
Registration and deregistration should work regardless of this timer. I think we can remove this initialization timer.
Could you also fix the clang-tidy warnings? https://github.com/autowarefoundation/autoware.core/actions/runs/8692780358/job/23838263634?pr=73
@xmfcx Thank you for your review.
Why is there a startup timer in the ACC?
The purpose of the startup timer is to manage the case then ACC was some how re-launched without re-launch the rest of the system. So nodes don't know that there is a new ACC and they need to re-register themself. ACC don't know about it as well. And timer start re-registering in that case.
During the regular startup (the whole system) registration and de-registration work fine and the timer doesn't affect them.
@lexavtanke I understand. But we should avoid these kind of timing based alignments and instead solve it with an event-driven architecture.
In case of:
This way, regardless of timing, when a new ACC comes out, it makes sure everyone registers from scratch.
@xmfcx It seems like we have an issue with cpplint related to https://github.com/cpplint/cpplint/issues/184 after we suppress clang-tidy with // NOLINTNEXTLINE(performance-unnecessary-value-param)
It seems they have added clang-analyzer-
to the list of ignored prefixes for cpplint.
But ours is has a performance-
prefix.
Which is not in the error categories listed here:
I couldn't find a way to add our check to this list. Easiest solution is to just add it as // NOLINTNEXTLINE
without the name.
Another could be to disable cpplint altogether but I won't do that for now.
Edit:
Oh it seems they have added performance-
to the list too with https://github.com/cpplint/cpplint/commit/8eeaf2352ad474c011339342022df7d278ac43c5
will try to use this version.
We can keep it as // NOLINTNEXTLINE
for now and when updated to 1.7.0 or 2.0.0 we can bring it back.
@xmfcx DCO behaves a bit strange on two commits https://github.com/autowarefoundation/autoware.core/pull/73/commits/8ec287d58c1f286b922d5bfca5a92d947adbfdad and https://github.com/autowarefoundation/autoware.core/pull/73/commits/5c03cbb54f336411a5f1c44f13c65c30f4b9dd14
As both of them have Signed-off-by
@lexavtanke we can ignore the DCO checks, as long as the squashed commit has the correct signatures, it is alright. I set it to pass.
Is this ready for review again?
@xmfcx Yes, it is ready. I've changed the info level to debug and remove the startup timer from ACC.
Signed-off-by: M. Fatih Cırıt mfc@leodrive.ai
Description
Related Discussion: https://github.com/orgs/autowarefoundation/discussions/3194 Old PR: https://github.com/autowarefoundation/autoware.core/pull/57
This PR is for creating the base class for all the future Autoware Core nodes to inherit from.
This node will be inheriting from the rclcpp_lifecycle::LifecycleNode with (LifeCycle Readme).
I'll be taking ros2 nav2 stack as reference too while combining Lifecycle and Composable node concepts.
While developing this PR, I am going to explore ideas mentioned in previous conversations in the Autoware community.
Related links
Related links
- Autoware.Auto discussions: https://gitlab.com/autowarefoundation/autoware.auto/AutowareAuto/-/issues/821 - Discuss Implementing LifeCycleNode Support on Autoware.Auto Nodes - https://gitlab.com/autowarefoundation/autoware.auto/AutowareAuto/-/issues/282 - Autoware Diagnostics and Monitoring - https://gist.github.com/xmfcx/7eaae9750d6317a3a2aa23745ac99444Communication architecture
Communication architecture
### Registration Service **Service server:** `ACC` **Service client:** `AutowareNode` **Roles:** - Register the `AutowareNode` in `ACC` - Receive the UUID for the `AutowareNode` - If already registered, return error ```mermaid sequenceDiagram participant acc as Autoware Control Center participant node as node_a Note right of node: cli_reg node ->>+ acc: name: node_a Note left of acc: srv_reg acc ->> acc: generate a UUID for the node_a acc ->> acc: create a bond between acc and node_a Note left of acc: srv_reg acc -->>- node: UUID and registration success bool Note right of node: cli_reg ``` ### Heart beat bond ~~https://github.com/ros/bond_core/blob/ros2/ros2_migration_readme.md~~ Slow https://github.com/ros-safety/software_watchdogs https://design.ros2.org/articles/qos_deadline_liveliness_lifespan.html https://docs.ros.org/en/rolling/Concepts/About-Quality-of-Service-Settings.html#qos-events **Roles:** - As long as the `AutowareNode` is alive, it reports with this bond - If the `AutowareNode` dies, it triggers the `on_broken` callback ### Reporting service **Service server:** `ACC` **Service client:** `AutowareNode` **Roles:** - Reporting state of the `AutowareNode` to the `ACC` - Warning - Error ### Control service **Service server:** `AutowareNode` **Service client:** `ACC` Roles: - Shutting down the `AutowareNode` - Toggle `AutowareNode` between active-inactive?Tasks
Tests performed
PR Review Items
## Notes for reviewers ## Pre-review checklist for the PR author The PR author **must** check the checkboxes below when creating the PR. - [x] I've confirmed the [contribution guidelines]. - [x] The PR follows the [pull request guidelines]. ## In-review checklist for the PR reviewers The PR reviewers **must** check the checkboxes below before approval. - [ ] The PR follows the [pull request guidelines]. - [ ] The PR has been properly tested. - [ ] The PR has been reviewed by the code owners. ## Post-review checklist for the PR author The PR author **must** check the checkboxes below before merging. - [ ] There are no open discussions or they are tracked via tickets. - [ ] The PR is ready for merge. After all checkboxes are checked, anyone who has write access can merge the PR. [contribution guidelines]: https://autowarefoundation.github.io/autoware-documentation/main/contributing/ [pull request guidelines]: https://autowarefoundation.github.io/autoware-documentation/main/contributing/pull-request-guidelines/