anitsh / til

Today I Learn (til) - Github `Issues` used as daily learning management system for taking notes and storing resource links.
https://anitshrestha.com.np
MIT License
78 stars 11 forks source link

Observability, Monitoring, Analysis, Control #493

Open anitsh opened 3 years ago

anitsh commented 3 years ago

Scientific inquiry starts with observation. The more one can see, the more one can investigate - Martin Chalfie

If you are observable, I can understand you.

You can only monitor a system that’s observable.

Observability has been a part of scientific process improvement since ages. In fact, it a natural process, for example, we watch a football game, we understand why a team lost and other won. A normal football fan just watches the games. S/he does not store the each and every detailed information as it is not needed for a retrospect but the manager and the coaching team keeps records. The best source of data would be video recording with other metrics for specific players. So observing is really something we do naturally. Let's not feel overwhelmed like we invented fire.

image

Although we observe everything in our daily life, we do not need to keep records of each data (probably except of bank account) but business and systems keeps specific (or every) records as needed. These records are watched upon constantly to improve the process.

Lean and Six Sigma are sets of techniques for process improvement. In fact, lean management and Six Sigma share similar methodologies and tools, including the fact that both were influenced by Japanese business culture. However, lean management primarily focuses on eliminating waste through tools that target organizational efficiencies while integrating a performance improvement system, while Six Sigma focuses on eliminating defects and reducing variation. Both systems are driven by data, though Six Sigma is much more dependent on accurate data. So I believe this method would be more appropriate for improving observability.

Six Sigma projects follow two project methodologies:

I think DMAIC would be more useful for observability. The DMAIC project methodology has five phases:

Some organizations add a Recognize step at the beginning, which is to recognize the right problem to work on, thus yielding an RDMAIC methodology.

OBSERVE EVERYTHING Monitoring applications has significantly changed, with applications now split across many microservices and clusters. When an organization grows to hundreds or thousands of containers, it is no longer possible to monitor individual components using legacy systems. Through programmatic observability of the micro- service application logs, attackers scanning or trying to exploit applications can be identified easily.

image

Observability is the ability to infer internal states of a system based on the system’s external outputs. In control theory, observability is a mathematical dual (follows a direct conceptual mapping) to controllability, which is the ability to control internal states of a system by manipulating external inputs. In practice, however, controllability is difficult to evaluate mathematically; therefore, system observability is the method for evaluating outputs to reach meaningful conclusions about internal states of the system.

Monitoring, is defined as the actions involved in observability: observing the quality of system performance over a time duration. The monitoring action, which tools and processes support, can describe the performance, health, and relevant characteristics of a system’s internal states. In enterprise IT, monitoring refers specifically to the process of translating infrastructure log metrics data into meaningful and actionable insights.

Monitoring and observability are in a symbiotic relationship and observability is achieved when data is made available from within the system that you wish to monitor. Monitoring is the actual task of collecting and displaying this data. A system monitor is a hardware or software component used to monitor system resources and performance in a computer system. Among the management issues regarding use of system monitoring tools are resource usage and privacy.

Observability and monitoring complement each other, with each one serving a different purpose. Monitoring tells you when something is wrong, while observability enables you to understand why. Monitoring is a subset of and key action for observability.

Monitoring tracks the overall health of an application. It aggregates data on how the system is performing in terms of access speeds, connectivity, downtime, and bottlenecks. Observability, on the other hand, drills down into the “what” and “why” of application operations, by providing granular and contextual insight into its specific failure modes.

While monitoring provides answers only for known problems or occurrences, software instrumented for observability allows developers to ask new questions in order to debug a problem or gain insight into the general state of what is typically a dynamic system with changing complexities and unknown permutations.

There are three main pillars to Observability:

Resource

Reference

408 #186 #393

Tools

385 #246 #368 #18

Popular tools

Prometheus primarily on the gathering time-series data, Grafana which focuses on monitoring and reporting time-series metrics and Kibana which focuses on log search, Jaeger focuses on root causing specific issues with a service mesh or dependency issues.

The two caveats are the level of expertise required when building a solution with these open-source products. It will be very much a DIY experience.

Resource

anitsh commented 3 years ago

Metrics

Factors that can affect what you choose to collect and act on are:

Host-Based Metrics

Application Metrics

Network and Connectivity Metrics

Server Pool Metrics

When dealing with horizontally scaled infrastructure, another layer of infrastructure you will need to add metrics for is pools of servers. While metrics about individual servers are useful, at scale a service is better represented as the ability of a collection of machines to perform work and respond adequately to requests. This type of metric is in many ways just a higher level extrapolation of application and server metrics, but the resources in this case are homogeneous servers instead of machine-level components. Some data you might want to track are:

Collecting data that summarizes the health of collections of servers is important for understanding the actual capabilities of your system to handle load and respond to changes.

External Dependency Metrics

Common Terminologies

The following are a few of the many potential metrics to draw from for an overall picture of system health:

Admins monitor server health through the following parameters: