Open anitsh opened 3 years ago
Resources available for tracking: Depending on your human resources, infrastructure, and budget, you will have to limit the scope of what you keep track of to what you can afford to implement and reasonably manage.
The complexity and purpose of your application: The complexity of your application or systems can have a large impact on what you choose to track. Items that might be mission critical for some software might not be important at all in others.
The deployment environment: While robust monitoring is most important for production systems, staging and testing systems also benefit from monitoring, though there may be differences in severity, granularity, and the overall metrics measured.
The likelihood of the metric being useful: One of the most important factors affecting whether something is measured is its potential to help in the future. Each additional metric tracked increases the complexity of the system and takes up resources. The necessity of data can change over time as well, requiring reevaluation at regular intervals.
How essential stability is: Simply put, stability and uptime might not be priorities for certain types of personal or early stage projects.
When dealing with horizontally scaled infrastructure, another layer of infrastructure you will need to add metrics for is pools of servers. While metrics about individual servers are useful, at scale a service is better represented as the ability of a collection of machines to perform work and respond adequately to requests. This type of metric is in many ways just a higher level extrapolation of application and server metrics, but the resources in this case are homogeneous servers instead of machine-level components. Some data you might want to track are:
Collecting data that summarizes the health of collections of servers is important for understanding the actual capabilities of your system to handle load and respond to changes.
The following are a few of the many potential metrics to draw from for an overall picture of system health:
Admins monitor server health through the following parameters:
Server availability and uptime: Servers should be “up” most of the time—think 99% of the time. If you start to drop below that, it’s time to pay attention.
Security: Tracking server security means keeping an eye on modifications, unauthorized access, and other security events (typically by scanning event logs).
System performance: Do you have what you need in place to support server performance? This includes metrics like CPU utilization, sufficient RAM, hard drive space, and bandwidth.
Application performance: Applications and services run on your servers, so it’s critical to understand how these processes are affecting performance and server load.
Scientific inquiry starts with observation. The more one can see, the more one can investigate - Martin Chalfie
You can only monitor a system that’s observable.
Observability has been a part of scientific process improvement since ages. In fact, it a natural process, for example, we watch a football game, we understand why a team lost and other won. A normal football fan just watches the games. S/he does not store the each and every detailed information as it is not needed for a retrospect but the manager and the coaching team keeps records. The best source of data would be video recording with other metrics for specific players. So observing is really something we do naturally. Let's not feel overwhelmed like we invented fire.
Although we observe everything in our daily life, we do not need to keep records of each data (probably except of bank account) but business and systems keeps specific (or every) records as needed. These records are watched upon constantly to improve the process.
Lean and Six Sigma are sets of techniques for process improvement. In fact, lean management and Six Sigma share similar methodologies and tools, including the fact that both were influenced by Japanese business culture. However, lean management primarily focuses on eliminating waste through tools that target organizational efficiencies while integrating a performance improvement system, while Six Sigma focuses on eliminating defects and reducing variation. Both systems are driven by data, though Six Sigma is much more dependent on accurate data. So I believe this method would be more appropriate for improving observability.
Six Sigma projects follow two project methodologies:
DMAIC is used for projects aimed at improving an existing business process
DMADV is used for projects aimed at creating new product or process designs
I think DMAIC would be more useful for observability. The DMAIC project methodology has five phases:
Some organizations add a Recognize step at the beginning, which is to recognize the right problem to work on, thus yielding an RDMAIC methodology.
OBSERVE EVERYTHING Monitoring applications has significantly changed, with applications now split across many microservices and clusters. When an organization grows to hundreds or thousands of containers, it is no longer possible to monitor individual components using legacy systems. Through programmatic observability of the micro- service application logs, attackers scanning or trying to exploit applications can be identified easily.
Observability is the ability to infer internal states of a system based on the system’s external outputs. In control theory, observability is a mathematical dual (follows a direct conceptual mapping) to controllability, which is the ability to control internal states of a system by manipulating external inputs. In practice, however, controllability is difficult to evaluate mathematically; therefore, system observability is the method for evaluating outputs to reach meaningful conclusions about internal states of the system.
Monitoring, is defined as the actions involved in observability: observing the quality of system performance over a time duration. The monitoring action, which tools and processes support, can describe the performance, health, and relevant characteristics of a system’s internal states. In enterprise IT, monitoring refers specifically to the process of translating infrastructure log metrics data into meaningful and actionable insights.
Monitoring and observability are in a symbiotic relationship and observability is achieved when data is made available from within the system that you wish to monitor. Monitoring is the actual task of collecting and displaying this data. A system monitor is a hardware or software component used to monitor system resources and performance in a computer system. Among the management issues regarding use of system monitoring tools are resource usage and privacy.
Observability and monitoring complement each other, with each one serving a different purpose. Monitoring tells you when something is wrong, while observability enables you to understand why. Monitoring is a subset of and key action for observability.
Monitoring tracks the overall health of an application. It aggregates data on how the system is performing in terms of access speeds, connectivity, downtime, and bottlenecks. Observability, on the other hand, drills down into the “what” and “why” of application operations, by providing granular and contextual insight into its specific failure modes.
While monitoring provides answers only for known problems or occurrences, software instrumented for observability allows developers to ask new questions in order to debug a problem or gain insight into the general state of what is typically a dynamic system with changing complexities and unknown permutations.
There are three main pillars to Observability:
Resource
Reference
408 #186 #393
Tools
385 #246 #368 #18
Popular tools
Prometheus primarily on the gathering time-series data, Grafana which focuses on monitoring and reporting time-series metrics and Kibana which focuses on log search, Jaeger focuses on root causing specific issues with a service mesh or dependency issues.
The two caveats are the level of expertise required when building a solution with these open-source products. It will be very much a DIY experience.
Resource