Open jsturdy opened 5 years ago
Considering that the trigger for this discussion is a very specific method and the error is set during the register access, I would not remove the setting of the error key in the register access methods. A better (and rather simple) solution to me would be to provide a method to remove/revoke key setting. This requires small changes to the RPCSvc code which has to be agreed (and propagated to) UW.
Considering that the trigger for this discussion is a very specific method and the error is set during the register access, I would not remove the setting of the error key in the register access methods.
Well there was a discussion in the design of the GBT functions directly; and in some sense I think the design of those functions was limited because of how the error
key is inserted and the lack of error handling in the repo. I'm referring to this discussion.
It seems like having to check if an error key should be removed (particularly when this amount of checks could grow inside nested loops) is not ideal and could introduce unwanted processing time on the Zynq.
Rather than relying on UW to change the RPCSvc
code (which they may reject) I think we could just add to the readReg/readAddress
and writeReg/writeAddress
methods a flag errorsWillBeRaised
(or shorter name). This flag would default to True
and then the job of the developer is specifically to disable it when not desired.
- Most error handling should probably be done before the RPC callback, and either forwarded back to the RPC method (which would then set the
error
key), or handled based on the error in the function itself.
What types of errors do we want to handle. I think this is the biggest question. The following come to mind:
read memsvc error: Bus error accessing [some address] failed 10 times
Assertion Failed on line 21: rpc.load_module("extras", "extras v1.0.1")
[Connection error] RPC connection failed
RPCErrorException
(presumably by xhal
code).Caught RPCErrorException: Unable to connect: Connection refused
[Connection error] RPC connection failed
And Example 4: One additional exception that could arise, and be recovered automatically, in my view is a loss of sync of the front-end. This could be assigned if a bus error occurs and the corresponding sync_error_counter is nonzero (we could also subdivide this into the case where the bus error is generated but the sync_error_counter is 0x0). We might consider this being automatically handled by issuing a link reset (GEM_AMC.GEM_SYSTEM.CTRL.LINK_RESET
) and then trying whatever transaction that caused the issue (read/write) again. The link reset does not cause a loss of the front-end configuration. Either way we choose to handle it, this is a type of error that could benefit from classification: e.g. SyncErrorException
.
Example | Exception Type |
---|---|
1 | BusError |
2 | ModuleLoadError |
3 | ConnectionError |
4 | SyncError |
We also have the "Node Not Found" or "Address Not Found" cases. But I am not sure if those warrant handling as they are probably an indication of either 1) bad design by developer (e.g. trying to access a non-existent node) or 2) bad backend configuration (e.g. wrong address table, lmdb, fw version, etc...).
Outside, in the calling code, handling of RPC
error
key messages should be done; incmsgemos
my current implementation is throwing an exception if theerror
key exists, and populating the exception with the message.
- This has the benefit that it requires no special processing of the
error
message string (it simply assumes that things that could be handled were, and now it just needs to alert that things are not right)- It has the drawback that if the
error
message encodes something that could be parsed in the calling code and trigger some action, one would then have to parse the exception (not really a problem)
I think if we provide 4 separate type of exceptions for the cases above we would not need to parse the string. Alternatively the error string could just be made to include a keyword that is the type of error message above (see table above for keyword suggestions).
- The question I would raise would be, why would we want this behaviour? What types of routines would benefit from this type of fine-grained handling of errors that the generic exception wouldn't?
The following functions or error handling comes to mind:
SyncError
exception. Right now a call of vfatSyncCheckLocal()
is usually performed and a function exits if the return call of this is not the bit flipped version of the vfatmask
BusError
could benefit from an attempt to handle; in some cases a BusError
is really a disguised SyncError
. There is some ambiguity here between the two error types (I admit) but the problem is they are operationally hard to distinguish.ConnectionError
could include attempting to see if the rpcsvc
service is running on the card and if not starting it as gemuser
then retrying the previous action. Although one could argue this is something that cannot be recovered from as it indicates the back-end is in a bad state...Just some thoughts on the matter. Hopefully they are constructive.
It seems like having to check if an error key should be removed (particularly when this amount of checks could grow inside nested loops) is not ideal and could introduce unwanted processing time on the Zynq.
I disagree here. It doesn't matter if you check whether you want to set the error
key or whether you want to remove it.
Rather than relying on UW to change the RPCSvc code (which they may reject)
They should not. The proposed change adds functionality while not touching anything beyond it.
I think we could just add to the readReg/readAddress and writeReg/writeAddress methods a flag errorsWillBeRaised (or shorter name). This flag would default to True and then the job of the developer is specifically to disable it when not desired.
While this is a possible solution, one should not forget that such change will introduce check on whether the error
key should be set on every failed transaction which may slow down certain operations which are not related to GBT. On the other hand, a solution I proposed won't affect operations which do not require ignoring the register access errors. E.g. we have 10 retries on each failed register read to check against not stable connection or timeouts due to AXI bus being busy.
I disagree here. It doesn't matter if you check whether you want to set the error key or whether you want to remove it.
I was thinking about the added time for one set of transactions (e.g. read N_OH's * N_VFAT's). But this is probably a much smaller added time than the case you outlined below (e.g. checking if insertErrorKey
is true
in all failed transactions).
...one should not forget that such change will introduce check on whether the error key should be set on every failed transaction which may slow down certain operations...
Yes indeed that would be bad.
The proposed change adds functionality while not touching anything beyond it.
I am not opposed to this idea (adding to the RPCSvc
) and of course it could be a simple solution; just playing devil's advocate and attempting to explore additional ideas.
One more consideration: actually an error
key is only an additional key in the RPC message, it is not an exception. In most of the cases it indeed means something went wrong, but it is up to us (the caller code) to decide whether we want to throw an exception in such case or not. So in certain cases it can be simply ignored or at least do not act as a showstopper... Instead, an exception should be thrown if such error is a real deal.
One more consideration: actually an
error
key is only an additional key in the RPC message, it is not an exception. In most of the cases it indeed means something went wrong, but it is up to us (the caller code) to decide whether we want to throw an exception in such case or not. So in certain cases it can be simply ignored or at least do not act as a showstopper... Instead, an exception should be thrown if such error is a real deal.
Yes if we went this route the caller code would need to always check and parse the error message; and we should adopt some standardisations for this error message.
(Started writing this but had to leave, so this may not be salient any longer)
I think we could just add to the
readReg
/readAddress
andwriteReg
/writeAddress
methods a flag
This is a bad way to go, as it will break some common API things that we have already worked hard to establish (a common interface viz the uhal
interface).
I think the proper way is to simply raise exceptions and handle them (and yes, sometimes handling means putting an error
key in the RPC message)
The point I think we should keep in mind here is that exceptions (loosely defined) can propagate on the card, and on the host PC, but not between. It is for this reason (and this reason alone, as far as I understand) that we have the error
key: to indicate across this boundary that the functional contract was not upheld.
This I why I prefer pushing the setting of any RPC message error
key as close to the RPC callback itself.
@mexanick's proposal of a remove_key
function can work, but would ad some overhead, and what I would like to better understand is this: in current usage, is there anything that cares about the error key between the RPC callback and the low-level read
/write
?
So in certain cases it can be simply ignored or at least do not act as a showstopper... Instead, an exception should be thrown if such error is a real deal.
Can you provide current (or potential) examples of this? Because I want to understand if the current handler now can be adapted to handle this generically, or if it would need a case-by-case treatment. I preferred the generic handler route because it avoids adding 15 lines of code for every RPC call...
The goal here should be robust operation coupled with clear delineation of error conditions.
During the GBT module writing, it was not especially easy to write functions where all return codes were checked. So, I also began to think about exception handling. Here are some of the reflections I had about :
The exceptions are propagated from the RPC module to the RPC client (via a special key, prefixed with __
for example) in order to be sure not to crash the RPC process, but report as much information as possible. This could be done through some kind of wrapper (which would also solve the concern raised here about the getLocalArgs
function). In pseudo-code that gives :
auto exportRPCMethod(std::function<uint32_t(LocalArgs &la)> func){
auto return_function = [=](const RPCMsg *request, RPCMsg *response)
{
CREATE_LOCAL_ARGS();
try {
function_return_value = func(la);
// Fill a special RPC key with the return value
}
catch (const std::exception& e) {
// Fill special RPC keys
}
}
return return_function:
}
static auto writeGBTConfig_EXPORTED = exportRPCMethod(writeGBTConfig); // Could be a macro uin32_t writeGBTConfig(LocalArgs &la) { ... }
/ in module_init() / modmgr->register_method("gbt", "writeGBTConfig", &writeGBTConfig_EXPORTED);
* [@bdorney's exception classification](https://github.com/cms-gem-daq-project/ctp7_modules/issues/85#issuecomment-460250057) which eases the exception parsing could also be propagated from the RPC module to the RPC client (via another special key). In this case, the exception classes could reside in the shared part of `xHAL`. Moreover, some xHAL exceptions would directly be thrown from the xHAL client side (`ModuleLoadError`, `ConnectionError`, ...)
* Regarding the normal execution flow, some functions would need multiple (two ?) signatures depending if they return an error code or they throw an exception. For the `readReg` example :
``` C++
uint32_t readReg(localArgs * la, const std::string & regName);
bool readReg(localArgs * la, const std::string & regName, uint32_t ®Value) noexcept;
Such functions would solve the "GBT case" as well as all other cases where one would rely on errors for proper operation.
error
key. I understand it does not necessarily means that an exception should be thrown, but only that something bad happened. So, should the exceptions be merged with the error
key ? Or would it be preferable to fill an array of error
's in order to get some kind of trace while the exception immediately aborts the call ?A few comments:
xhal
exceptions are currently being caught and rethrown as necessary, what we're trying to do here is only establish the module side error handling.Moreover, what needs to drive this is the actual contract the functions are asked to enforce. Take for instance the specific case of this GBT phase tool that started this discussion. What is the contract that the caller expects when calling this function?
error
key can indicate this second part, but handling the read
"exception" (not using this term as, e.g., std::exception
, though I do have a preference for that mechanism in this discussion) in the first case does not need to result in an error being returned.
Brief summary of issue
This issue will attempt to normalize and define how errors are handled inside the CTP7 modules and corresponding RPC error keys.
Types of issue
Considerations
error
key is set, then the thing making the RPC call should expect that there was a failure that couldn't be handled inside the method itself.error
key), or handled based on the error in the function itself.error
key messages should be done; incmsgemos
my current implementation is throwing an exception if theerror
key exists, and populating the exception with the message.error
message string (it simply assumes that things that could be handled were, and now it just needs to alert that things are not right)error
message encodes something that could be parsed in the calling code and trigger some action, one would then have to parse the exception (not really a problem)