Scientific inquiry starts with observation. The more one can see, the more one can investigate - Martin Chalfie

If you are observable, I can understand you.

You can only monitor a system that’s observable.

Observability has been a part of scientific process improvement since ages. In fact, it a natural process, for example, we watch a football game, we understand why a team lost and other won. A normal football fan just watches the games. S/he does not store the each and every detailed information as it is not needed for a retrospect but the manager and the coaching team keeps records. The best source of data would be video recording with other metrics for specific players. So observing is really something we do naturally. Let's not feel overwhelmed like we invented fire.

Although we observe everything in our daily life, we do not need to keep records of each data (probably except of bank account) but business and systems keeps specific (or every) records as needed. These records are watched upon constantly to improve the process.

Lean and Six Sigma are sets of techniques for process improvement. In fact, lean management and Six Sigma share similar methodologies and tools, including the fact that both were influenced by Japanese business culture. However, lean management primarily focuses on eliminating waste through tools that target organizational efficiencies while integrating a performance improvement system, while Six Sigma focuses on eliminating defects and reducing variation. Both systems are driven by data, though Six Sigma is much more dependent on accurate data. So I believe this method would be more appropriate for improving observability.

Six Sigma projects follow two project methodologies:

DMAIC is used for projects aimed at improving an existing business process
DMADV is used for projects aimed at creating new product or process designs

I think DMAIC would be more useful for observability. The DMAIC project methodology has five phases:

Define the system, the voice of the customer and their requirements, and the project goals, specifically.
Measure key aspects of the current process and collect relevant data; calculate the "as-is" process capability
Analyze the data to investigate and verify cause and effect. Determine what the relationships are, and attempt to ensure that all factors have been considered. Seek out the root cause of the defect under investigation.
Improve or optimize the current process based upon data analysis using techniques such as design of experiments, poka yoke or mistake proofing, and standard work to create a new, future state process. Set up pilot runs to establish process capability.
Control the future state process to ensure that any deviations from the target are corrected before they result in defects. Implement control systems such as statistical process control, production boards, visual workplaces, and continuously monitor the process. This process is repeated until the desired quality level is obtained.

Some organizations add a Recognize step at the beginning, which is to recognize the right problem to work on, thus yielding an RDMAIC methodology.

OBSERVE EVERYTHING Monitoring applications has significantly changed, with applications now split across many microservices and clusters. When an organization grows to hundreds or thousands of containers, it is no longer possible to monitor individual components using legacy systems. Through programmatic observability of the micro- service application logs, attackers scanning or trying to exploit applications can be identified easily.

Whitepaper "Adopting a DevSecOps Approach for Modern Apps" by VMWare

Observability is the ability to infer internal states of a system based on the system’s external outputs. In control theory, observability is a mathematical dual (follows a direct conceptual mapping) to controllability, which is the ability to control internal states of a system by manipulating external inputs. In practice, however, controllability is difficult to evaluate mathematically; therefore, system observability is the method for evaluating outputs to reach meaningful conclusions about internal states of the system.

Monitoring, is defined as the actions involved in observability: observing the quality of system performance over a time duration. The monitoring action, which tools and processes support, can describe the performance, health, and relevant characteristics of a system’s internal states. In enterprise IT, monitoring refers specifically to the process of translating infrastructure log metrics data into meaningful and actionable insights.

Monitoring and observability are in a symbiotic relationship and observability is achieved when data is made available from within the system that you wish to monitor. Monitoring is the actual task of collecting and displaying this data. A system monitor is a hardware or software component used to monitor system resources and performance in a computer system. Among the management issues regarding use of system monitoring tools are resource usage and privacy.

Observability and monitoring complement each other, with each one serving a different purpose. Monitoring tells you when something is wrong, while observability enables you to understand why. Monitoring is a subset of and key action for observability.

Monitoring tracks the overall health of an application. It aggregates data on how the system is performing in terms of access speeds, connectivity, downtime, and bottlenecks. Observability, on the other hand, drills down into the “what” and “why” of application operations, by providing granular and contextual insight into its specific failure modes.

While monitoring provides answers only for known problems or occurrences, software instrumented for observability allows developers to ask new questions in order to debug a problem or gain insight into the general state of what is typically a dynamic system with changing complexities and unknown permutations.

There are three main pillars to Observability:

Tracing #501
Metrics
Logging #520

Resource

Reference

408 #186 #393

Tools

385 #246 #368 #18

Popular tools

Prometheus primarily on the gathering time-series data, Grafana which focuses on monitoring and reporting time-series metrics and Kibana which focuses on log search, Jaeger focuses on root causing specific issues with a service mesh or dependency issues.

The two caveats are the level of expertise required when building a solution with these open-source products. It will be very much a DIY experience.

Resource

Metrics

Factors that can affect what you choose to collect and act on are:

Resources available for tracking: Depending on your human resources, infrastructure, and budget, you will have to limit the scope of what you keep track of to what you can afford to implement and reasonably manage.
The complexity and purpose of your application: The complexity of your application or systems can have a large impact on what you choose to track. Items that might be mission critical for some software might not be important at all in others.
The deployment environment: While robust monitoring is most important for production systems, staging and testing systems also benefit from monitoring, though there may be differences in severity, granularity, and the overall metrics measured.
The likelihood of the metric being useful: One of the most important factors affecting whether something is measured is its potential to help in the future. Each additional metric tracked increases the complexity of the system and takes up resources. The necessity of data can change over time as well, requiring reevaluation at regular intervals.
How essential stability is: Simply put, stability and uptime might not be priorities for certain types of personal or early stage projects.

Host-Based Metrics

CPU
Memory
Disk space
Processes

Application Metrics

Error and success rates
Service failures and restarts
Performance and latency of responses
Resource usage

Network and Connectivity Metrics

Connectivity
Error rates and packet loss
Latency
Bandwidth utilization

Server Pool Metrics

When dealing with horizontally scaled infrastructure, another layer of infrastructure you will need to add metrics for is pools of servers. While metrics about individual servers are useful, at scale a service is better represented as the ability of a collection of machines to perform work and respond adequately to requests. This type of metric is in many ways just a higher level extrapolation of application and server metrics, but the resources in this case are homogeneous servers instead of machine-level components. Some data you might want to track are:

Pooled resource usage
Scaling adjustment indicators
Degraded instances

Collecting data that summarizes the health of collections of servers is important for understanding the actual capabilities of your system to handle load and respond to changes.

External Dependency Metrics

Service status and availability
Success and error rates
Run rate and operational costs
Resource exhaustion

Common Terminologies

Observability: Although not strictly defined, observability is a general term used to describe processes and techniques related to increasing awareness and visibility into systems. This can include monitoring, metrics, visualization, tracing, and log analysis.
Resource: In the context of monitoring and software systems, a resource is any exhaustible or limited dependency. What is considered a resource can vary greatly based on part of the system being discussed.
- Latency: Latency is a measure of the time it takes to complete an action. Depending on the component, this can be a measure of processing, response, or travel time.
Throughput: Throughput represents the maximum rate of processing or traversal that a system can handle. This can be dependent on software or hardware design. Often there is an important distinction between theoretical throughput and practical observed throughput.
Performance: Performance is a general measure of how efficiently a system is completing work. Performance is an umbrella term that often encompasses work factors like throughput, latency, or resource consumption.
Saturation: Saturation is a measure of the amount of capacity being used. Full saturation indicates that 100% of the capacity is currently in use.
- Visualization: Visualization is the process of presenting metrics data in a format that allows for quick, intuitive interpretation through graphs or charts.
- Log aggregation: Log aggregation is the act of compiling, organizing, and indexing log files to allow for easier management, searching, and analysis. While separate from monitoring, aggregated logs can be used in conjunction with the monitoring system to identify causes and investigate failures.
- Data point: A data point is a single measurement of a single metric.
- Data set: A data set is a collection of data points for a metric.
- Units: Units are the context for a measured value. A unit defines the magnitude, scope, or quantity of a measurement to understand extent and allow comparison.
- Percentage Units: Percentage units are measurements that are taken as a part of a finite whole. A percentage unit indicates how much a value is out of the total possible amount.
- Rate Units: Rate units indicate the magnitude of a metric over a constant period of time.
- Time series: Time series data is a series of data points that represent changes over time. Most metrics are best represented by a time series because single data points often represent a value at a specific time and the resulting series of points is used to show changes over time.
- Sampling rate: Sample rate is a measurement of how often a representative data point is collected in lieu of continuous collection. A higher sampling rate more accurately represents the measured behavior, but requires more resources to handle the extra data points.
- Resolution: Resolution refers to the density of data points that make up a data set. Collections with higher resolutions over the same time frame indicate a higher sample rate and a more granular view of the same behavior.
- Instrumentation: Instrumentation is the ability to track the behavior and performance of software. This is accomplished by adding code and configuration to software to output data that can then be consumed by a monitoring system.
- The observer effect: The observer effect is the impact of the monitoring system itself on the phenomena being observed. Since monitoring takes up resources, the act of measuring behavior and performance will alter the values produced. Monitoring systems seek to avoid adding unnecessary overhead to minimize this impact.
- Over-monitoring: Over-monitoring occurs when the quantity of metrics and alerts configured is inversely related to their usefulness. Over-monitoring can cause stress on the infrastructure, make it difficult to find relevant data, and cause teams to lose trust in their monitoring and alerting systems.
Alert fatigue: Alert fatigue is the human response of desensitivity that results from frequent, unreliable, or improperly prioritized alerts. Alert fatigue can cause operators to ignore severe problems and is usually an indication that alert conditions need to be reevaluated.
- Threshold: When alerting, a threshold is the boundary between acceptable and unacceptable values which triggers an alert if exceeded. Often alerts are configured to trigger when a value exceeds the threshold for a certain period of time, in order to avoid sending an alert for temporary spikes.
- Quantile: A quantile is a dividing point used to separate a dataset into distinct groups based on their values. Quantiles are used to put values into “buckets” that represent segments of a population of data. Often, this is used to separate common values from outliers to better understand what constitutes representative and extreme cases.
- Trend: A trend is the general direction that a set of values is indicating. Trends are more reliable than single values in determining the general state of the component being tracked.
White-box monitoring: White-box monitoring is a term used to describe monitoring that relies on access to internal state of the components being measured. White-box monitoring can provide a detailed understanding of system state and is helpful for identifying causes of problems.
- Black-box monitoring: Black-box monitoring is monitoring that observes the external state of a system or component by looking only at its inputs, outputs, and behavior. This type of monitoring can closely align with a user’s experience of a system, but is less useful for finding the cause of problems.

The following are a few of the many potential metrics to draw from for an overall picture of system health:

CPU usage: This helps us to measure the load on a server’s processor. If CPU utilization is high, we may need to replace the hardware, so it can better manage all the services running on it. We can also redistribute the load to help avoid over-utilization.
Disk performance: Storage performance can affect your applications, so it helps us to monitor queued input/output (I/O) and disk latency for a sense of how often the disk is busy. High disk queue length may be caused by a storage performance issue, indicating it’s time to change RAID type or add physical disks.
Physical and virtual memory: By tracking physical memory, we can see where there’s the potential for bottlenecks, which indicates it’s necessary to add more RAM. Meanwhile, as virtual memory consumption goes up, more data moves to and from RAM, raising the chances of a bottleneck or swap file fragmentation happening.

Admins monitor server health through the following parameters:

Server availability and uptime: Servers should be “up” most of the time—think 99% of the time. If you start to drop below that, it’s time to pay attention.
Security: Tracking server security means keeping an eye on modifications, unauthorized access, and other security events (typically by scanning event logs).
System performance: Do you have what you need in place to support server performance? This includes metrics like CPU utilization, sufficient RAM, hard drive space, and bandwidth.
Application performance: Applications and services run on your servers, so it’s critical to understand how these processes are affecting performance and server load.
[ ] https://openmetrics.io

anitsh / til

Observability, Monitoring, Analysis, Control #493

Resource

Reference

408 #186 #393

Tools

385 #246 #368 #18

Popular tools

Resource

Metrics

Factors that can affect what you choose to collect and act on are:

Host-Based Metrics

Application Metrics

Network and Connectivity Metrics

Server Pool Metrics

External Dependency Metrics

Common Terminologies