Non recoverable error conditions and reporting implementation status

irudkin commented 7 years ago

Moved from issue #8 (Bugzilla 16059) as discussed meeting 30/01/2017. Deemed better as a new separate issue. To be re-worded and remove comment 3 list item 3. Re-edit to mention watch dog or cross checking verification systems to allow for reporting of status. This issue is related issue #8.

Comment 3 illya@codeplay.com 2016-11-15 03:49:39 PST

Discuss: Ideally there should not be any non-recoverable error conditions. As a guideline the implementers should consider however if there are non-recoverable states then the client (developers using the SC API) should be made aware off the following:

all known non-recoverable states
an error code if any for each non-recoverable state
how the non-recoverable state can be entered
how to recover from the recoverable state to return to normal expected behaviour state
if the non-recoverable state has been entered does it switch to a safety mode of operation i.e. it is continuing to work but at reduced functionality. State the behaviour when in this mode.

This would allow the client to develop tests for those states and where applicable verify their recover processes do work.

Comment 4 illya@codeplay.com 2016-11-22 00:36:27 PST

From discussion it was not clear the intention of comment 3. The implementer should provide as much documentation as possible on the reasons a for undefined state behaviour represented by the returned error code. This is very much implementation specific and so nothing more can be added in the guidelines apart for a strong recommendation. If the user has done his due diligence anyway they should be asking such questions anyway especially is the implementation is a black box - what are the side affects?

irudkin commented 7 years ago

This issue is related issue #8. New text:

For a system to improve during development or maintain its safety integrity during commission it is advisable for critical systems (systems that risk injury or death) that they also operate along side a separate system which monitors the status of that critical system. Normally known by the name of a Watchdog (not a Watchdog timer - it resets the system) it is a system which can perform cross checking with other systems if necessary and report the status of itself and the system it is monitoring using a separate API. Such systems need to accommodate this kind of reporting in the safety criticlal API. Other considerations are mentioned below.

A critical system ideally should not be able to enter non-recoverable state. It is likely that in such a state it would not be able to return a status condition for the user to act on. This guideline suggests the implementers should document:

All known non-recoverable states with matching error status code (if able)
A matching Watchdog status condition code where applicable
How to recover from the recoverable state to return to normal expected behaviour state
If a non-recoverable state has been entered whether the implementation enters one or more operational but normally reduced functionality modes (a "limp home mode") and be able to report that the safety mode(s) is operational as expected.

The client would be able to develop tests for those those and where applicable verify their recover processors do work.

irudkin commented 7 years ago

New text to be entered:

A system should facilitate the communication of all its operational states to its client or monitoring system efficiently and effectively to aid its development, testing regime and deployment. A system could communicate state using: The interface’s or API functions’ parameters Drive signals on connected hardware *Write diagnostics to shared memory

The critical safety level of the system will likely dictate the partitioning design of the system with other components which in turn will influence how a SC system communicates its state.

It is important that all states and error conditions are described discretely and not grouped into a general state or error condition. The system should be expected to provide information about itself at anytime in a deterministic and timely manner in what ever operational mode it is currently in. For example: normal running behaviour a safety mode, reduced functionality ‘limp home’ mode *indeterminate unrecoverable state

For highly critical safety systems a separate Watch Dog component is likely deployed to oversee those critical systems and monitor their behaviour for abnormal events which then signals higher level systems to take appropriate corrective action.

A safety critical system’s API should be documented including: normal behavioural states all error states all safety mode limiting behavioural states, how they are entered and whether they are recoverable *all non deterministic non recoverable states and how they are entered.

Ideally a safety critical system should not have any non deterministic non recoverable operational states.

irudkin commented 7 years ago

New text entered. Need review.

KhronosGroup / KSCAF_DocGuidelines

Non recoverable error conditions and reporting implementation status #16