[feature-request] Define behaviour of error conditions, handling, and passing

jsturdy commented 5 years ago

Brief summary of issue

This issue will attempt to normalize and define how errors are handled inside the CTP7 modules and corresponding RPC error keys.

Types of issue

[X] Discussion
[X] Feature request (request for change which adds functionality)

Considerations

If an RPC message error key is set, then the thing making the RPC call should expect that there was a failure that couldn't be handled inside the method itself.
Most error handling should probably be done before the RPC callback, and either forwarded back to the RPC method (which would then set the error key), or handled based on the error in the function itself.
Outside, in the calling code, handling of RPC error key messages should be done; in cmsgemos my current implementation is throwing an exception if the error key exists, and populating the exception with the message.
- This has the benefit that it requires no special processing of the error message string (it simply assumes that things that could be handled were, and now it just needs to alert that things are not right)
- It has the drawback that if the error message encodes something that could be parsed in the calling code and trigger some action, one would then have to parse the exception (not really a problem)
- The question I would raise would be, why would we want this behaviour? What types of routines would benefit from this type of fine-grained handling of errors that the generic exception wouldn't?

mexanick commented 5 years ago

Considering that the trigger for this discussion is a very specific method and the error is set during the register access, I would not remove the setting of the error key in the register access methods. A better (and rather simple) solution to me would be to provide a method to remove/revoke key setting. This requires small changes to the RPCSvc code which has to be agreed (and propagated to) UW.

bdorney commented 5 years ago

Considering that the trigger for this discussion is a very specific method and the error is set during the register access, I would not remove the setting of the error key in the register access methods.

Well there was a discussion in the design of the GBT functions directly; and in some sense I think the design of those functions was limited because of how the error key is inserted and the lack of error handling in the repo. I'm referring to this discussion.

It seems like having to check if an error key should be removed (particularly when this amount of checks could grow inside nested loops) is not ideal and could introduce unwanted processing time on the Zynq.

Rather than relying on UW to change the RPCSvc code (which they may reject) I think we could just add to the readReg/readAddress and writeReg/writeAddress methods a flag errorsWillBeRaised (or shorter name). This flag would default to True and then the job of the developer is specifically to disable it when not desired.

bdorney commented 5 years ago

Most error handling should probably be done before the RPC callback, and either forwarded back to the RPC method (which would then set the error key), or handled based on the error in the function itself.

What types of errors do we want to handle. I think this is the biggest question. The following come to mind:

Example 1: Issue in an atomic transaction (read/write). This raises a Bus Error:

read memsvc error: Bus error accessing [some address] failed 10 times

Example 2: Issue in loading an module (e.g. due to a mismatch of SW on the card vs. calling PC). This raises a Connection Error:

Assertion Failed on line 21: rpc.load_module("extras", "extras v1.0.1")
[Connection error] RPC connection failed

Example 3: Issue connecting to the card, this raises a Connection Error; but it is assigned an RPCErrorException (presumably by xhal code).

Caught RPCErrorException: Unable to connect: Connection refused
[Connection error] RPC connection failed

And Example 4: One additional exception that could arise, and be recovered automatically, in my view is a loss of sync of the front-end. This could be assigned if a bus error occurs and the corresponding sync_error_counter is nonzero (we could also subdivide this into the case where the bus error is generated but the sync_error_counter is 0x0). We might consider this being automatically handled by issuing a link reset (GEM_AMC.GEM_SYSTEM.CTRL.LINK_RESET) and then trying whatever transaction that caused the issue (read/write) again. The link reset does not cause a loss of the front-end configuration. Either way we choose to handle it, this is a type of error that could benefit from classification: e.g. SyncErrorException.

Example	Exception Type
1	`BusError`
2	`ModuleLoadError`
3	`ConnectionError`
4	`SyncError`

We also have the "Node Not Found" or "Address Not Found" cases. But I am not sure if those warrant handling as they are probably an indication of either 1) bad design by developer (e.g. trying to access a non-existent node) or 2) bad backend configuration (e.g. wrong address table, lmdb, fw version, etc...).

Outside, in the calling code, handling of RPC error key messages should be done; in cmsgemos my current implementation is throwing an exception if the error key exists, and populating the exception with the message.

This has the benefit that it requires no special processing of the error message string (it simply assumes that things that could be handled were, and now it just needs to alert that things are not right)

It has the drawback that if the error message encodes something that could be parsed in the calling code and trigger some action, one would then have to parse the exception (not really a problem)

I think if we provide 4 separate type of exceptions for the cases above we would not need to parse the string. Alternatively the error string could just be made to include a keyword that is the type of error message above (see table above for keyword suggestions).

The question I would raise would be, why would we want this behaviour? What types of routines would benefit from this type of fine-grained handling of errors that the generic exception wouldn't?

The following functions or error handling comes to mind:

Anything that loops over VFATs could benefit from attempting to handle a SyncError exception. Right now a call of vfatSyncCheckLocal() is usually performed and a function exits if the return call of this is not the bit flipped version of the vfatmask
Things that raise a BusError could benefit from an attempt to handle; in some cases a BusError is really a disguised SyncError. There is some ambiguity here between the two error types (I admit) but the problem is they are operationally hard to distinguish.
Handling of a ConnectionError could include attempting to see if the rpcsvc service is running on the card and if not starting it as gemuser then retrying the previous action. Although one could argue this is something that cannot be recovered from as it indicates the back-end is in a bad state...

Just some thoughts on the matter. Hopefully they are constructive.

mexanick commented 5 years ago

It seems like having to check if an error key should be removed (particularly when this amount of checks could grow inside nested loops) is not ideal and could introduce unwanted processing time on the Zynq.

I disagree here. It doesn't matter if you check whether you want to set the error key or whether you want to remove it.

Rather than relying on UW to change the RPCSvc code (which they may reject)

They should not. The proposed change adds functionality while not touching anything beyond it.

I think we could just add to the readReg/readAddress and writeReg/writeAddress methods a flag errorsWillBeRaised (or shorter name). This flag would default to True and then the job of the developer is specifically to disable it when not desired.

While this is a possible solution, one should not forget that such change will introduce check on whether the error key should be set on every failed transaction which may slow down certain operations which are not related to GBT. On the other hand, a solution I proposed won't affect operations which do not require ignoring the register access errors. E.g. we have 10 retries on each failed register read to check against not stable connection or timeouts due to AXI bus being busy.

bdorney commented 5 years ago

I disagree here. It doesn't matter if you check whether you want to set the error key or whether you want to remove it.

I was thinking about the added time for one set of transactions (e.g. read N_OH's * N_VFAT's). But this is probably a much smaller added time than the case you outlined below (e.g. checking if insertErrorKey is true in all failed transactions).

...one should not forget that such change will introduce check on whether the error key should be set on every failed transaction which may slow down certain operations...

Yes indeed that would be bad.

The proposed change adds functionality while not touching anything beyond it.

I am not opposed to this idea (adding to the RPCSvc) and of course it could be a simple solution; just playing devil's advocate and attempting to explore additional ideas.

mexanick commented 5 years ago

One more consideration: actually an error key is only an additional key in the RPC message, it is not an exception. In most of the cases it indeed means something went wrong, but it is up to us (the caller code) to decide whether we want to throw an exception in such case or not. So in certain cases it can be simply ignored or at least do not act as a showstopper... Instead, an exception should be thrown if such error is a real deal.

bdorney commented 5 years ago

One more consideration: actually an error key is only an additional key in the RPC message, it is not an exception. In most of the cases it indeed means something went wrong, but it is up to us (the caller code) to decide whether we want to throw an exception in such case or not. So in certain cases it can be simply ignored or at least do not act as a showstopper... Instead, an exception should be thrown if such error is a real deal.

Yes if we went this route the caller code would need to always check and parse the error message; and we should adopt some standardisations for this error message.

jsturdy commented 5 years ago

(Started writing this but had to leave, so this may not be salient any longer)

I think we could just add to the readReg/readAddress and writeReg/writeAddress methods a flag

This is a bad way to go, as it will break some common API things that we have already worked hard to establish (a common interface viz the uhal interface). I think the proper way is to simply raise exceptions and handle them (and yes, sometimes handling means putting an error key in the RPC message)

The point I think we should keep in mind here is that exceptions (loosely defined) can propagate on the card, and on the host PC, but not between. It is for this reason (and this reason alone, as far as I understand) that we have the error key: to indicate across this boundary that the functional contract was not upheld. This I why I prefer pushing the setting of any RPC message error key as close to the RPC callback itself. @mexanick's proposal of a remove_key function can work, but would ad some overhead, and what I would like to better understand is this: in current usage, is there anything that cares about the error key between the RPC callback and the low-level read/write?

So in certain cases it can be simply ignored or at least do not act as a showstopper... Instead, an exception should be thrown if such error is a real deal.

Can you provide current (or potential) examples of this? Because I want to understand if the current handler now can be adapted to handle this generically, or if it would need a case-by-case treatment. I preferred the generic handler route because it avoids adding 15 lines of code for every RPC call...

The goal here should be robust operation coupled with clear delineation of error conditions.

lpetre-ulb commented 5 years ago

During the GBT module writing, it was not especially easy to write functions where all return codes were checked. So, I also began to think about exception handling. Here are some of the reflections I had about :

Do not use the exceptions in the normal execution flow, but only when something not locally recoverable occurs : incorrect function parameters, link down, unexpected bus error, ... These exceptions could exceptionally (it is not Python and while not throwing an exception is free, catching one is slow) be caught within the RPC module if ever it is recoverable.

The exceptions are propagated from the RPC module to the RPC client (via a special key, prefixed with __ for example) in order to be sure not to crash the RPC process, but report as much information as possible. This could be done through some kind of wrapper (which would also solve the concern raised here about the getLocalArgs function). In pseudo-code that gives :


auto exportRPCMethod(std::function<uint32_t(LocalArgs &la)> func){
auto return_function = [=](const RPCMsg *request, RPCMsg *response)
{
    CREATE_LOCAL_ARGS();

    try {
        function_return_value = func(la);
        // Fill a special RPC key with the return value
    }
    catch (const std::exception& e) {
        // Fill special RPC keys
    }
}

return return_function:
}

static auto writeGBTConfig_EXPORTED = exportRPCMethod(writeGBTConfig); // Could be a macro uin32_t writeGBTConfig(LocalArgs &la) { ... }

/ in module_init() / modmgr->register_method("gbt", "writeGBTConfig", &writeGBTConfig_EXPORTED);

* [@bdorney's exception classification](https://github.com/cms-gem-daq-project/ctp7_modules/issues/85#issuecomment-460250057) which eases the exception parsing could also be propagated from the RPC module to the RPC client (via another special key). In this case, the exception classes could reside in the shared part of `xHAL`. Moreover, some xHAL exceptions would directly be thrown from the xHAL client side (`ModuleLoadError`, `ConnectionError`, ...)
* Regarding the normal execution flow, some functions would need multiple (two ?) signatures depending if they return an error code or they throw an exception. For the `readReg` example :
``` C++
uint32_t readReg(localArgs * la, const std::string & regName);
bool readReg(localArgs * la, const std::string & regName, uint32_t &regValue) noexcept;

Such functions would solve the "GBT case" as well as all other cases where one would rely on errors for proper operation.

One last comment about the error key. I understand it does not necessarily means that an exception should be thrown, but only that something bad happened. So, should the exceptions be merged with the error key ? Or would it be preferable to fill an array of error's in order to get some kind of trace while the exception immediately aborts the call ?

jsturdy commented 5 years ago

A few comments:

The xhal exceptions are currently being caught and rethrown as necessary, what we're trying to do here is only establish the module side error handling.
I'm fairly strongly opposed to duplicating functions for the sole purpose of adding a return value and a return code implementation side-by side. It adds complexity without a big benefit: the API should do one or the other, but not both.
The proposed generic wrapper works only so long as no parsing of the RPC request message is necessary to pass along arguments, as such, I again feel that adding this additional wrapper layer adds complexity when it isn't necessary.

Moreover, what needs to drive this is the actual contract the functions are asked to enforce. Take for instance the specific case of this GBT phase tool that started this discussion. What is the contract that the caller expects when calling this function?

Scan all phases of all VFATs (or at least those specified) and report back whether each phase results in good/bad communication?
Set the phase of each VFAT to an appropriate value, based on the scan?
As mentioned in #84, this routine by definition will take the hardware into a regime where normal behaviour is no longer guaranteed, but the function should know (and expect) this, and should be able to enforce the contract of "return the status at each phase", but may not be able to guarantee that the correct phase was set. In this case, the error key can indicate this second part, but handling the read "exception" (not using this term as, e.g., std::exception, though I do have a preference for that mechanism in this discussion) in the first case does not need to result in an error being returned.

cms-gem-daq-project / ctp7_modules