Provide human-readable definitions/alarms for RCL(C) initialisation errors

ted-miller commented 1 year ago

@ted-miller: we might want to consider reporting some of these RCL(C) initialisation errors, or defining some subcodes for them.

Especially the NODE, PUBLISHER, SUBSCRIPTION, etc categories seem to be caused by things which could often be solved by users without needing to post on the tracker here.

Originally posted by @gavanderhoorn in https://github.com/Yaskawa-Global/motoros2/discussions/125#discussioncomment-6743301

gavanderhoorn commented 1 year ago

Looking into this a bit: this wouldn't be too much work to implement, but I'm trying to figure out a good place for these kinds of "RCL(C) initialisation errors/alarms" in ErrorHandling.h.

We don't really have a main alarm category for those I believe:

https://github.com/Yaskawa-Global/motoros2/blob/c9a9670d46298fceb4d4197e345c0515f66ca946/src/ErrorHandling.h#L71-L82

The issue title already refers to these kinds of problems as "initialisation errors", so it doesn't seem like they would fit under ALARM_CONFIGURATION_FAIL or ALARM_TASK_CREATE_FAIL. And we only use ALARM_ASSERTION_FAIL for actual assertions, which don't get much text on the PP.

do we still have 'space' for a category dedicated to RCLC_INIT errors/alarms?

We could then raise a regular alarm and provide some descriptive text as part of it.

Edit: we could also make it a 'micro-ROS' category.

gavanderhoorn commented 1 year ago

As to subcodes: initially I thought we could perhaps opt to just 'forward' the various RCL_RET_* defined codes.

However, there is no guarantee those values will never change and that would mean we'd have to keep track of that and update our documentation if that happens.

Two approaches I've come up with:

define a single subcode, reuse one of the alarm categories, and include the RCL(C) error code as part of the message text
define (something like) an ALARM_RCL_RCLC_FAIL category, then (re)define relevant (perhaps all?) RCL_RET_* as subcodes (in a new typedef enum). RCL(C) API failures could then be posted to the PP:
```
mpSetAlarm(ALARM_RCL_RCLC_FAIL, "Invalid node name",
   SUBCODE_RCL_RCLC_FAIL_NODE_INVALID_NAME);
```

the former would be trivial to implement (although we need to figure out which category to use).

The latter would be a little more work, but would make RCL(C) errors more 'explicit' and we could document each of them individually.

I first preferred the second approach, but thinking about it, the former is less work, gets us what we're looking for (a way to document at least some of the RCL(C) initialisation errors/return codes when they could potentially be solved by the user) and wouldn't require adding 30 new subcodes and a new alarm category.

We could still document individual values (ie: 202) in the troubleshooting guide. The entry point would just be a single alarm + subcode, and the text would clarify which RCL(C) error it really is.

ted-miller commented 1 year ago

I'm in favor of the second option. I think that the purpose of raising these new alarms is to make it easier for the user the debug. In that regard, providing an explicit definition of the error (as opposed to some number that must be looked up) is much more user-friendly.

gavanderhoorn commented 1 year ago

It would mean adding quite a few new subcodes, with a lot of documentation to write as well.

gavanderhoorn commented 11 months ago

I'm going to postpone this to after 0.2.0, which would focus on updating the underlying infrastructure (newer Humble packages, Iron compatibility).

gavanderhoorn commented 11 months ago

It would mean adding quite a few new subcodes, with a lot of documentation to write as well.

An additional consideration: IIUC, subcodes are limited to a UCHAR, which would mean a maximum of 255.

I realise there are only so many rcl(c) errors we might want to directly map, but they'd still be taking up valuable 'space'.

ted-miller commented 6 months ago

I just had another occurrence of this and wasted a good amount of time trying to remember how to track down the rcl error codes.

This shouldn't be difficult to implement, so I retargeted the next milestone.

gavanderhoorn commented 6 months ago

Should this be 0.2.1 instead?

0.2.0 was really intended to be just about updating libmicroros and adding Iron support.

ted-miller commented 6 months ago

Sure. I just want to get it on the todo list.

jimmy-mcelwain commented 4 days ago

I am working on this, but I wanted to point out that the alarm will no longer contain information about where the problem happened that it currently does. For example, right now we have

ret = rclc_service_init_default(&g_serviceQueueTrajPoint, &g_microRosNodeInfo.node, type_support, SERVICE_NAME_QUEUE_TRAJ_POINT);
motoRosAssert((ret == RCL_RET_OK), SUBCODE_FAIL_CREATE_SERVICE_QUEUE_POINT);

The subcode here shows the user where the error happened (initializing the /queue_traj_point service in this case) but it does not tell the user what the problem is.

The proposed solution would describe to the user what the problem is, but it would fail to describe where that problem happened. So if the user receives an alarm with the message "Invalid node name", the user may not know whether which node this applies to.

There is not enough space in the alarm message (32 chars) to fully describe what went wrong. The debug broadcast could be used to give more detail, but the error message won't contain all the information that the user needs to solve it. It doesn't match the principle that @ted-miller describes here

Also, should I include "RCL/RCLC" in every alarm to make it clear, or can we assume that the user will look up the error code and see that the main alarm code is specifically for RCL/RCLC? This isn't as important, but I have been including "RCL/RCLC" in each alarm message, and it would be nice to have an extra few characters to use to write the actual alarm message.

ted-miller commented 4 days ago

As we discussed, you can use multiple concurrent alarms to provide additional data.

I'd suggest ommiting the "RCL/RCLC", as that's half your message. We don't need users to know that the code came from that layer necessarily. They can look up the code/message in our troubleshooting guide. Then we can provide additional detail there.

gavanderhoorn commented 1 day ago

@jimmy-mcelwain: would you already have pushed your changes somewhere?

RCL(C) provides some ways of retrieving a humand readable error message, so I'm curious what your approach has been so far.

We would really want to avoid defining a set of (shadow) codes ourselves I believe.

As we discussed, you can use multiple concurrent alarms to provide additional data.

I was indeed also going to suggest something like that.

jimmy-mcelwain commented 6 hours ago

I just pushed a branch report_rcl_errors. It is not complete.

I did define a set of codes shadow codes myself. The issue is that the RCL_RET error codes defined in types.h range from 0 to 3001, but our alarm subcodes can only go up to 256.

Similarly, the alarm messages have a character limit of 32, so I have been manually creating them based on documentation rather than using any provided human-readable messages. That being said I can still print the provided string (I believe using rcutils_get_error_string()) to the debug log

And while I can raise multiple alarms, something will have to change. As of right now motoRosAssertand motoRosAssert_withMsg both print to the debug log forever if the assertion is false/the alarms are triggered. If we want to trigger multiple alarms, I am thinking that instead of doing this, each "assert" function should just check the condition and if it needs to alarm, puts it onto a queue of alarms and then there is a RaiseAlarms() function or something similar that gets called after the assert conditions are checked. This function will raise the alarms in the queue and print to the debug log forever. That way, all the alarms and the full group of debug messages for alarms can be printed.

jimmy-mcelwain commented 6 hours ago

There is also a bit of overlap with #268

Yaskawa-Global / motoros2

Provide human-readable definitions/alarms for RCL(C) initialisation errors #128