anitsh commented 3 years ago

Distributed Tracing

A trace is a snapshot of all context and timing info as a request propagates through a service mesh or hits various services in a microservice architecture. Because trace generation is expensive, sampling is usually employed to snapshot every X requests or specific criteria.

Distributed tracing, one of the main pillars of Observability #493 . Distributed tracing provides us with the insights into the full lifecycle of requests in your system. If a user logged onto the website, depending on the complexity of your system architecture, they may hit a few services along the way. Each request that they make, will be traced through the system. Having a strong observability platform will allow you to gain insights to the details of the trace and even uncover issues with your system that we were never able to see. Where are the performance bottle necks? Was there an error, if so, where did it start from? All of these are questions in which, when answered, allow you to improve your systems user experience.

One of the first thing to go as we make the software systems distributed is our ability to observe and understand what the application as a whole is doing. We are usually well-set with tools telling us what each individual component is doing, while leaving us in the dark about the big picture. Distributed or end-to-end tracing is one of the tools that can shine the light on the big picture and provide both the macro-view of the system, as well as the micro-view of each individual request executed by the distributed application.

The most difficult form of observability is distributed tracing within and between application services. Creating this form of observability in an efficient and effective way requires strong experience and an understanding of the underlying principles of tracing requests that flow between services. If you can completely automate the process of creating distributed tracing observability (as Instana does) you will have found the Holy Grail of observability and monitoring.

Tracing makes your observable system more effective and allows you to identify the root cause of an issue in a distributed system. Tracing can be seen as the most important part of observability implementation: understanding the causal relationship in your microservices architecture and being able to follow the issue from the effect to the cause, and vice versa.

There are different approaches used to implement tracing infrastructures. The most popular approach, used by almost all production-grade tracing systems today, is to pass certain metadata (aka trace context) along the path of the request execution, which can be used to correlate performance data collected from multiple components of the system and reassemble them into a coherent trace of the whole request. The underlying mechanism used to pass that trace context is called distributed context propagation.

Distributed context propagation is a generic mechanism that can be used for purposes completely unrelated to end-to-end tracing. In the OpenTracing API it is called “baggage”, a term coined by Prof. Rodrigo Fonseca from Brown University, because it allows us to attach arbitrary data to a request and have this data automatically propagated by the framework to all downstream network calls made by the current microservice or component. The baggage is carried alongside the business requests and can be read at any point in the request execution.

Distributed context propagation is a very powerful technique, especially in the new age architectures built with many microservices. It allows is to associate an arbitrary metadata with each request and make it available at every point in the distributed call graph transparently to the services involved in processing that request.

Resource

anitsh commented 3 years ago

OpenTracing, OpenCensus, OpenTelemetry

OpenTracing

OpenTracing – a CNCF project, now an incubating project – was/is a vendor-agnostic standardised API that allowed engineers to instrument traces throughout their code-base. It allowed for the creation of instrumentation libraries that would essentially wrap around application code in order to record and report trace information.

OpenCensus

OpenCensus – a Google project – is a set of libraries that allow you to collect application metrics and distributed traces in real-time. Similar to OpenTracing, it required the engineer to instrument the API calls into their code with the additional benefit of capturing metric data.

The problem engineers had with the two options above is deciding which one to use. Should they use OpenTracing for tracing and OpenCensus for metrics? Or should they use OpenCensus for both tracing and metrics? This is where OpenTelemetry came in.

OpenCensus is an open-source reincarnation of Google’s internal Census libraries used for collecting tracing and metrics data. It took a different approach (than OpenTracing, an open, vendor-neutral API to incorporate into the many source code) by providing a concrete, opinionated implementation for capturing observability signals. By employing “batteries included” approach it had an advantage over OpenTracing for software that was shipped as binaries, such as database engines, Kubernetes components, etc., because the binaries could link to a known implementation rather than use late binding to concrete tracers that was required by the OpenTracing approach. On the downside, since OpenCensus APIs were tightly coupled to the implementations, it was difficult and often impossible to bind the instrumentation to different implementations even when users wanted it.

Both projects were aiming to make observability easy for modern applications and expedite wide adoption of distributed tracing by the software industry. It turns out that the approaches of the two projects were complementary, rather than contradictory. There was no reason why we couldn’t have both the abstract, vendor-neutral API and a well-supported, default implementation. Enter OpenTelemetry!

OpenTelemetry

OpenTelemetry (OTEL) was formed by the merging of OpenTracing and OpenCensus. Currently a CNCF sandbox project and its second most active in terms of contributions – Kubernetes being the first –, OTEL since its inception aimed to offer a single set of APIs and libraries that standardise how you collect and transfer telemetry data. OpenTelemetry is used not only for logging, but also for metric collation and tracing.

OTEL not only aimed to simplify the choice, but it also allows for cross-platform capability with SDKs being written in several different languages and integrates with popular frameworks and libraries as well, such as Spring, ASP.NET Core, and Express.. Its architecture and SDKs allow for companies to develop their own instrumentation libraries and analyse the trace information with supported platforms.

The greatest promise of OpenTelemetry is not to solve some new problems that OpenTracing and OpenCensus did not solve. Instead, it is the promise of a single standard instead of competing two standards. Towards that goal the first GA versions of OpenTelemetry libraries are intentionally narrowly scoped to:

Be backwards compatible with OpenTracing and OpenCensus instrumentation via shims.
Avoid introducing new features not already present in the two original projects

The greatest promise of OpenTelemetry is a single standard for observability instead of two competing standards.

Resource

[ ] https://opentelemetry.io

anitsh commented 3 years ago

Jaeger

Jaeger, inspired by Dapper and OpenZipkin, is a distributed tracing system released as open source by Uber Technologies. It is used for monitoring and troubleshooting microservices-based distributed systems, including:

Distributed context propagation
Distributed transaction monitoring
Root cause analysis
Service dependency analysis
Performance / latency optimization

anitsh / til