Fault Tolerance

A system is an entity with a well-defined behavior in terms of output it produces and which is a function of the input it receives, the passage of time and its internal logic. By “well-defined behavior” we mean that the output produced by the system is previ- ously agreed upon and unambiguously distinguishable from output that does not qualify as well-defined behavior. The well-defined behavior of a system is called the system specification. A system interacts with its environment by receiving input from it and delivering output to it. It may be possible to decompose a system into constitu- ent (sub)systems. In Component-based software engineering (CBSE) terms, a system is a component that may consists of the assembly of a number of smaller components. In OO terms a system is a composition of objects, each of which may be itself a composition of smaller objects.

A failure is said to occur in a system when the system’s environment observes an output from the system that does not conform to its specification. An error is the part of the system, e.g. one of its constituent (sub)systems, which is liable to lead to a failure. A fault is the adjudged cause of an error and may itself be the result of a fail- ure. Hence, a fault causes an error that produces a failure, which subsequently may result to a fault, and so on. Let us consider the following example: A software bug in an application is a fault that leads to an error when the application execution reaches the point affected by the bug, which in turn makes the application crash which is a failure. By crashing, the applica- tion leaves blocked the socket ports it used which is a fault and the com- puter on which the application crashed has socket ports which are not used by any process nevertheless not accessible to running applications which is an error, and which in turn leads to a failure when another appli- cation requests these ports.

Based on the above, a fault in a system may propagate to the system's environment. A system is called fault tolerant when it can deal with faults and their consequent er- rors in such a way that it does not violate its specification, i.e. the environment of a fault tolerant system does not perceive a failure of the system. Hence, a fault tolerant system does not propagate faults to its environment. Fault tolerance techniques are practical methods that describe how to detect an error and confine it within a system. The confinement can be based on the restoration of the subsystem on which the er- ror was detected before that error infects other parts of the system, or it can be based on the masking of the error occurrence (e.g. by isolating the subsystem on which the error was detected and using some form of redundancy to deliver the expected out- put).

In general terms, fault tolerance provides techniques to confront faults and their con- sequences in a system. These techniques describe the detection of errors in a sys- tem, and the means that ensure the recovery of a system from errors or the masking of errors in a system.

Three constituents of fault tolerance are error detection, recovery and masking.

Principles Of Fault Tolerant System

• Constituents of a fault tolerant system monitor other constituents for failure occur- rences. By observing a failure, the monitoring subsystem can detect an error on the monitored subsystem. These monitoring activities are often called error detec- tion. • In order to enable the restoration of a subsystem after an error has been detected on it, appropriate information regarding the subsystem may be saved when certain conditions are met (e.g. at regular time intervals, right after the subsystem delivers some output according to its specification, when the subsystem decides by its own to save the appropriate information, etc). This saving activity is often called check- pointing. The appropriate information save in a checkpointing activity may vary from a complete snapshot of the internal subsystem representation (i.e. the state of the subsystem) to selected piece of its internal representation that have changed since the last checkpoint. • When a monitoring subsystem observes a failure on a monitored subsystem, it may activate a mechanism that will use the last checkpoint of the latter subsystem in order to eliminate the error that led to the observed failure and restore the sub- system to an error-free state. These restoration activities are often called error re- covery.• In some cases, when a monitoring subsystem observes a failure on a monitored subsystem, it does not let the erroneous behavior of the latter subsystem affect any other parts of the overall system by using a some form of redundancy (e.g. a duplicate of the failed subsystem) to cover up for the observed failure. These ac- tivities are often called error masking

Once the failure type and the unit of failure issues are sorted out, the designer has a clear indication about the what fault tolerance mechanisms to choose and where to apply them in the system in order to make it fault tolerant.

Failure Type

• fail-stop failures where the failed system ceases execution without producing any output and the failure is detectable by its environment, • crash failures where the failed subsystem ceases execution without producing any output but the failure might not be detectable by its environment, • omission failures where a subsystem fails to deliver output to (send omission), or receive input from (receive omission) its environment, and • byzantine failures where the failed subsystem exhibits arbitrary behavior.

Failure Unit

The unit of failure is the minimum part of the system (i.e. the minimum sub- system) where an error will be confined.

FAULT TOLERANCE PATTERNS

[x] Fail-Stop Processor
[ ] Acknowledgment
[ ] I Am Alive
[ ] Are You Alive
[ ] Roll Forward
[ ] Rollback
[ ] Passive Replication
[ ] Semi-Passive Replication
[ ] Semi-Active Replication
[ ] Active Replication

Resource:

https://hillside.net/europlop/HillsideEurope/Papers/EuroPLoP2002/2002_Saridakis_ASystemOfPatternsForFaultTolerance.pdf
[aviz:95] A. Avizienis, “Building dependable systems: How to keep up with complexity”, Twenty-Fifth Fault-Tolerant Computing Symposium Special Issue, pp. 4–14, June 1995. (Historical Interest).
[siew:95] D. P. Siewiorek, “Niche successes to ubiquitous invisibility: Fault-tolerant computing past, present and future”, Twenty-Fifth Fault-Tolerant Computing Symposium Special Issue, pp. 26–33, June 1995. (Historical Interest).
[cris:91] F. Cristian, “Understanding fault–tolerant distributed systems”, Communications of the ACM, vol. 34, no. 2, , February 1991. (Historical Interest).
[aviz:04] A. Avizienis, J. Laprie, B. Randell, and C. Landwehr, “Basic concepts and taxonomy of secure computing”, IEEE Transactions of Dependable and Secure Computing, vol. 1, no. 1, , January 2004. (Taxonomy and Definitions).
[abra:86] J. A. Abraham and W. K. Fuchs, “Fault and error models for VLSI”, Proceedings of the IEEE, vol. 74, no. 5, pp. 639–654, May 1986. (Fault Models and Historical Interest).
[mull:93] V. Hadzilacos and S. Toueg, “Fault tolerant broadcasts and related problems”, In Distributed Systems, S. Mullender, editor, pp. 100–102, Addison Wesley, 2nd edition, 1993. (Definition).
[kala:13] R. Kalayappan and S. R. Sarangi, “A survey of checker architectures”, ACM Computing Survey, vol. 45, no. 4, pp. 48:1–48:34, August 2013. (Fault Definitions and Architecture Level Techniques).
[goel:81] P. Goel, “An implicit enumeration algorithm to generate tests for combinational logic circuits”, IEEE Transactions on Computers, vol. C-30, no. 3, pp.215–222, March 1981. (Test Generation).
[duarte:11] E. P. Duarte, R. P. Ziwich, and L. Albini, “A survey of comparison-based system-level diagnosis”, ACM Computing Survey, vol. 43, no. 3, pp. 22:1–22:56, April 2011. (System Level Diagnosis).
[siew:92] C. L. Chen and M. Y. Hsiao, “Error-correcting codes for semiconductor memory applications: A state of the art review”, In Reliable Computer Systems - Design and Evaluation, D. P. Seiwiorek and R. S. Swarz, editors, pp. 771–786, Digital Press, 2nd edition, 1992. (ECC).KKS753–11–spring 2
[lu:13] S. Lu, H. Jheng, M. Hashizume, J. Huang, and P. Ning, “Fault scrambling techniques for yield enhancement of embedded memories”, Proceedings of IEEE Asian Test Symposium, pp. 215–220, November 2013. (Memory Reconfiguration).
[mahm:88] A. Mahmood and E. J. McCluskey, “Concurrent error detection using watchdog processor- A survey”, IEEE Transactions on Computers, vol. C-37, no. 2, pp. 160–174, February 1988. (Watchdog and Historical). -[rotenberg:99] E. Rotenberg, “AR-SMT: A microarchitecture approach to fault tolerance in microprocessors”, Proceedings of IEEE Fault-Tolerant Computing Symposium, pp. 84–91, June 1999. (Architecture Level).
[rashid:00] F. Rashid, K. K. Saluja, and P. Ramanathan, “Fault tolerance through re-execution in multiscalar architectures”, Proceedings of IEEE International Conference on Dependable Systems and Networks, also known as FTCS-30, pp. 482–491, June 2000. (Architecture Level).
[subra:10] P. Subramanyan, V. Singh, K. K. Saluja, and E. Larsson, “Energy efficient fault tolerance in chip multiprocessors using critical value forwarding”, Proceedings of IEEE International Conference on Dependable Systems and Networks, pp. 121–130, June 2010. (Architecture Level).
[aggarwal:07] N. Aggarwal, P. Ranganathan, N. Jouppi, and J. Smith, “Isolation in commodity multicore processors”, IEEE Computer, vol. 40, no. 6, pp. 49–59, June 2007. (Architecture Level).
[brooks:96] R. R. Brooks and S. S. Iyengar, “Robust distributed computing and sensing algorithm”, IEEE Computer, vol. 29, no. 6, pp. 53–60, June 1996. (Sensor Networks).
[Clouqueur:04] T. Clouqueur, K. K. Saluja, and P. Ramanathan, “Fault tolerance in collaborative sensor networks for target detection”, IEEE Transactions on Computers, vol. 52, no. 3, pp. 320–333, March 2004. (Sensor Networks).
[lapr:90] J.-C. Laprie, J. Arlat, C. Beounes, and K. Kanoun, “Definition and analysis of hardware and software fault-tolerant architectures”, IEEE Computer, vol. 23, no. 7, pp. 39–51, July 1990. (Historical).
[gray:90] J. Gray, “A census of Tandem system availability, 1985-1990”, IEEE Transactions on Reliability, vol. 39, no. 4, pp. 409–418, October 1990. (Historical).

anitsh / til

Fault Tolerance #498

Fault Tolerance

Principles Of Fault Tolerant System

Failure Type

Failure Unit

FAULT TOLERANCE PATTERNS

Resource:

Tool/Library

Fault Tolerance Strategy via isolation from external dependencies so that one may not affect the other

Thread Based Isolation

Semaphore Based Isolation

Resource