cms-gem-daq-project / ctp7_modules

0 stars 13 forks source link

[feature-request] Define behaviour of error conditions, handling, and passing #85

Open jsturdy opened 5 years ago

jsturdy commented 5 years ago

Brief summary of issue

This issue will attempt to normalize and define how errors are handled inside the CTP7 modules and corresponding RPC error keys.

Types of issue

Considerations

mexanick commented 5 years ago

Considering that the trigger for this discussion is a very specific method and the error is set during the register access, I would not remove the setting of the error key in the register access methods. A better (and rather simple) solution to me would be to provide a method to remove/revoke key setting. This requires small changes to the RPCSvc code which has to be agreed (and propagated to) UW.

bdorney commented 5 years ago

Considering that the trigger for this discussion is a very specific method and the error is set during the register access, I would not remove the setting of the error key in the register access methods.

Well there was a discussion in the design of the GBT functions directly; and in some sense I think the design of those functions was limited because of how the error key is inserted and the lack of error handling in the repo. I'm referring to this discussion.

It seems like having to check if an error key should be removed (particularly when this amount of checks could grow inside nested loops) is not ideal and could introduce unwanted processing time on the Zynq.

Rather than relying on UW to change the RPCSvc code (which they may reject) I think we could just add to the readReg/readAddress and writeReg/writeAddress methods a flag errorsWillBeRaised (or shorter name). This flag would default to True and then the job of the developer is specifically to disable it when not desired.

bdorney commented 5 years ago
  • Most error handling should probably be done before the RPC callback, and either forwarded back to the RPC method (which would then set the error key), or handled based on the error in the function itself.

What types of errors do we want to handle. I think this is the biggest question. The following come to mind:

read memsvc error: Bus error accessing [some address] failed 10 times
Assertion Failed on line 21: rpc.load_module("extras", "extras v1.0.1")
[Connection error] RPC connection failed
Caught RPCErrorException: Unable to connect: Connection refused
[Connection error] RPC connection failed

And Example 4: One additional exception that could arise, and be recovered automatically, in my view is a loss of sync of the front-end. This could be assigned if a bus error occurs and the corresponding sync_error_counter is nonzero (we could also subdivide this into the case where the bus error is generated but the sync_error_counter is 0x0). We might consider this being automatically handled by issuing a link reset (GEM_AMC.GEM_SYSTEM.CTRL.LINK_RESET) and then trying whatever transaction that caused the issue (read/write) again. The link reset does not cause a loss of the front-end configuration. Either way we choose to handle it, this is a type of error that could benefit from classification: e.g. SyncErrorException.

Example Exception Type
1 BusError
2 ModuleLoadError
3 ConnectionError
4 SyncError

We also have the "Node Not Found" or "Address Not Found" cases. But I am not sure if those warrant handling as they are probably an indication of either 1) bad design by developer (e.g. trying to access a non-existent node) or 2) bad backend configuration (e.g. wrong address table, lmdb, fw version, etc...).

  • Outside, in the calling code, handling of RPC error key messages should be done; in cmsgemos my current implementation is throwing an exception if the error key exists, and populating the exception with the message.

    • This has the benefit that it requires no special processing of the error message string (it simply assumes that things that could be handled were, and now it just needs to alert that things are not right)
    • It has the drawback that if the error message encodes something that could be parsed in the calling code and trigger some action, one would then have to parse the exception (not really a problem)

I think if we provide 4 separate type of exceptions for the cases above we would not need to parse the string. Alternatively the error string could just be made to include a keyword that is the type of error message above (see table above for keyword suggestions).

  • The question I would raise would be, why would we want this behaviour? What types of routines would benefit from this type of fine-grained handling of errors that the generic exception wouldn't?

The following functions or error handling comes to mind:

Just some thoughts on the matter. Hopefully they are constructive.

mexanick commented 5 years ago

It seems like having to check if an error key should be removed (particularly when this amount of checks could grow inside nested loops) is not ideal and could introduce unwanted processing time on the Zynq.

I disagree here. It doesn't matter if you check whether you want to set the error key or whether you want to remove it.

Rather than relying on UW to change the RPCSvc code (which they may reject)

They should not. The proposed change adds functionality while not touching anything beyond it.

I think we could just add to the readReg/readAddress and writeReg/writeAddress methods a flag errorsWillBeRaised (or shorter name). This flag would default to True and then the job of the developer is specifically to disable it when not desired.

While this is a possible solution, one should not forget that such change will introduce check on whether the error key should be set on every failed transaction which may slow down certain operations which are not related to GBT. On the other hand, a solution I proposed won't affect operations which do not require ignoring the register access errors. E.g. we have 10 retries on each failed register read to check against not stable connection or timeouts due to AXI bus being busy.

bdorney commented 5 years ago

I disagree here. It doesn't matter if you check whether you want to set the error key or whether you want to remove it.

I was thinking about the added time for one set of transactions (e.g. read N_OH's * N_VFAT's). But this is probably a much smaller added time than the case you outlined below (e.g. checking if insertErrorKey is true in all failed transactions).

...one should not forget that such change will introduce check on whether the error key should be set on every failed transaction which may slow down certain operations...

Yes indeed that would be bad.

The proposed change adds functionality while not touching anything beyond it.

I am not opposed to this idea (adding to the RPCSvc) and of course it could be a simple solution; just playing devil's advocate and attempting to explore additional ideas.

mexanick commented 5 years ago

One more consideration: actually an error key is only an additional key in the RPC message, it is not an exception. In most of the cases it indeed means something went wrong, but it is up to us (the caller code) to decide whether we want to throw an exception in such case or not. So in certain cases it can be simply ignored or at least do not act as a showstopper... Instead, an exception should be thrown if such error is a real deal.

bdorney commented 5 years ago

One more consideration: actually an error key is only an additional key in the RPC message, it is not an exception. In most of the cases it indeed means something went wrong, but it is up to us (the caller code) to decide whether we want to throw an exception in such case or not. So in certain cases it can be simply ignored or at least do not act as a showstopper... Instead, an exception should be thrown if such error is a real deal.

Yes if we went this route the caller code would need to always check and parse the error message; and we should adopt some standardisations for this error message.

jsturdy commented 5 years ago

(Started writing this but had to leave, so this may not be salient any longer)

I think we could just add to the readReg/readAddress and writeReg/writeAddress methods a flag

This is a bad way to go, as it will break some common API things that we have already worked hard to establish (a common interface viz the uhal interface). I think the proper way is to simply raise exceptions and handle them (and yes, sometimes handling means putting an error key in the RPC message)

The point I think we should keep in mind here is that exceptions (loosely defined) can propagate on the card, and on the host PC, but not between. It is for this reason (and this reason alone, as far as I understand) that we have the error key: to indicate across this boundary that the functional contract was not upheld. This I why I prefer pushing the setting of any RPC message error key as close to the RPC callback itself. @mexanick's proposal of a remove_key function can work, but would ad some overhead, and what I would like to better understand is this: in current usage, is there anything that cares about the error key between the RPC callback and the low-level read/write?

So in certain cases it can be simply ignored or at least do not act as a showstopper... Instead, an exception should be thrown if such error is a real deal.

Can you provide current (or potential) examples of this? Because I want to understand if the current handler now can be adapted to handle this generically, or if it would need a case-by-case treatment. I preferred the generic handler route because it avoids adding 15 lines of code for every RPC call...

The goal here should be robust operation coupled with clear delineation of error conditions.

lpetre-ulb commented 5 years ago

During the GBT module writing, it was not especially easy to write functions where all return codes were checked. So, I also began to think about exception handling. Here are some of the reflections I had about :

static auto writeGBTConfig_EXPORTED = exportRPCMethod(writeGBTConfig); // Could be a macro uin32_t writeGBTConfig(LocalArgs &la) { ... }

/ in module_init() / modmgr->register_method("gbt", "writeGBTConfig", &writeGBTConfig_EXPORTED);

* [@bdorney's exception classification](https://github.com/cms-gem-daq-project/ctp7_modules/issues/85#issuecomment-460250057) which eases the exception parsing could also be propagated from the RPC module to the RPC client (via another special key). In this case, the exception classes could reside in the shared part of `xHAL`. Moreover, some xHAL exceptions would directly be thrown from the xHAL client side (`ModuleLoadError`, `ConnectionError`, ...)
* Regarding the normal execution flow, some functions would need multiple (two ?) signatures depending if they return an error code or they throw an exception. For the `readReg` example :
``` C++
uint32_t readReg(localArgs * la, const std::string & regName);
bool readReg(localArgs * la, const std::string & regName, uint32_t &regValue) noexcept;

Such functions would solve the "GBT case" as well as all other cases where one would rely on errors for proper operation.

jsturdy commented 5 years ago

A few comments:

Moreover, what needs to drive this is the actual contract the functions are asked to enforce. Take for instance the specific case of this GBT phase tool that started this discussion. What is the contract that the caller expects when calling this function?