anitsh / til

Today I Learn (til) - Github `Issues` used as daily learning management system for taking notes and storing resource links.
https://anitshrestha.com.np
MIT License
77 stars 11 forks source link

Fault Tolerance #498

Open anitsh opened 3 years ago

anitsh commented 3 years ago

Fault Tolerance

A system is an entity with a well-defined behavior in terms of output it produces and which is a function of the input it receives, the passage of time and its internal logic. By “well-defined behavior” we mean that the output produced by the system is previ- ously agreed upon and unambiguously distinguishable from output that does not qualify as well-defined behavior. The well-defined behavior of a system is called the system specification. A system interacts with its environment by receiving input from it and delivering output to it. It may be possible to decompose a system into constitu- ent (sub)systems. In Component-based software engineering (CBSE) terms, a system is a component that may consists of the assembly of a number of smaller components. In OO terms a system is a composition of objects, each of which may be itself a composition of smaller objects.

A failure is said to occur in a system when the system’s environment observes an output from the system that does not conform to its specification. An error is the part of the system, e.g. one of its constituent (sub)systems, which is liable to lead to a failure. A fault is the adjudged cause of an error and may itself be the result of a fail- ure. Hence, a fault causes an error that produces a failure, which subsequently may result to a fault, and so on. Let us consider the following example: A software bug in an application is a fault that leads to an error when the application execution reaches the point affected by the bug, which in turn makes the application crash which is a failure. By crashing, the applica- tion leaves blocked the socket ports it used which is a fault and the com- puter on which the application crashed has socket ports which are not used by any process nevertheless not accessible to running applications which is an error, and which in turn leads to a failure when another appli- cation requests these ports.

Based on the above, a fault in a system may propagate to the system's environment. A system is called fault tolerant when it can deal with faults and their consequent er- rors in such a way that it does not violate its specification, i.e. the environment of a fault tolerant system does not perceive a failure of the system. Hence, a fault tolerant system does not propagate faults to its environment. Fault tolerance techniques are practical methods that describe how to detect an error and confine it within a system. The confinement can be based on the restoration of the subsystem on which the er- ror was detected before that error infects other parts of the system, or it can be based on the masking of the error occurrence (e.g. by isolating the subsystem on which the error was detected and using some form of redundancy to deliver the expected out- put).

In general terms, fault tolerance provides techniques to confront faults and their con- sequences in a system. These techniques describe the detection of errors in a sys- tem, and the means that ensure the recovery of a system from errors or the masking of errors in a system.

Three constituents of fault tolerance are error detection, recovery and masking.

Principles Of Fault Tolerant System

• Constituents of a fault tolerant system monitor other constituents for failure occur- rences. By observing a failure, the monitoring subsystem can detect an error on the monitored subsystem. These monitoring activities are often called error detec- tion. • In order to enable the restoration of a subsystem after an error has been detected on it, appropriate information regarding the subsystem may be saved when certain conditions are met (e.g. at regular time intervals, right after the subsystem delivers some output according to its specification, when the subsystem decides by its own to save the appropriate information, etc). This saving activity is often called check- pointing. The appropriate information save in a checkpointing activity may vary from a complete snapshot of the internal subsystem representation (i.e. the state of the subsystem) to selected piece of its internal representation that have changed since the last checkpoint. • When a monitoring subsystem observes a failure on a monitored subsystem, it may activate a mechanism that will use the last checkpoint of the latter subsystem in order to eliminate the error that led to the observed failure and restore the sub- system to an error-free state. These restoration activities are often called error re- covery.• In some cases, when a monitoring subsystem observes a failure on a monitored subsystem, it does not let the erroneous behavior of the latter subsystem affect any other parts of the overall system by using a some form of redundancy (e.g. a duplicate of the failed subsystem) to cover up for the observed failure. These ac- tivities are often called error masking

Once the failure type and the unit of failure issues are sorted out, the designer has a clear indication about the what fault tolerance mechanisms to choose and where to apply them in the system in order to make it fault tolerant.

Failure Type

• fail-stop failures where the failed system ceases execution without producing any output and the failure is detectable by its environment, • crash failures where the failed subsystem ceases execution without producing any output but the failure might not be detectable by its environment, • omission failures where a subsystem fails to deliver output to (send omission), or receive input from (receive omission) its environment, and • byzantine failures where the failed subsystem exhibits arbitrary behavior.

Failure Unit

The unit of failure is the minimum part of the system (i.e. the minimum sub- system) where an error will be confined.

FAULT TOLERANCE PATTERNS

Resource:

Tool/Library

anitsh commented 3 years ago

Fault Tolerance Strategy via isolation from external dependencies so that one may not affect the other

Thread Based Isolation

In this isolation method; the external call gets executed on a different thread (non-application thread), so that the application thread is not affected by anything that goes wrong with the external call.

Semaphore Based Isolation

In this method, the external call runs on the application thread and the number of concurrent calls are limited by the semaphore count defined in the configuration.

Resource