HeinrichHartmann / svc

7 stars 0 forks source link

Observability #1

Open HeinrichHartmann opened 1 year ago

HeinrichHartmann commented 1 year ago

Home Lab Observability

In this post, we are exploring free/observability tooling on the example of my home-lab setup.

Design Goals

The OpenTelemetry Collector

The central hub of the setup is the OpenTelemetry (OTEL) Collector.

OpenTelemetry is a industry standard, that is bridging across a large number of vendors and open source tools and covers all three telemetry pillars (logs, metrics and traces). The open telemetry collector is a general data-broker, that accepts telemetry from a large variety of sources (aka. receivers) and is able to forward telemetry to a large variety of telemetry backend systems (see exporters).

image

flowchart LR subgraph "Telemetry Sources" A1[App1] A2[App2] A3[App3] end OtelCollector(("OpenTelemetry Collector")) subgraph "Telemetry Backends" B1[Backend1] B2[Backend2] B3[Backend3] end A1 -.-> |Traces, Metrics, Logs| OtelCollector A2 -.-> |Traces, Metrics, Logs| OtelCollector A3 -.-> |Traces, Metrics, Logs| OtelCollector OtelCollector -.-> |Traces, Metrics, Logs| B1 OtelCollector -.-> |Traces, Metrics, Logs| B2 OtelCollector -.-> |Traces, Metrics, Logs| B3

This means that:

  1. By sending all all telemetry sources (logs, metrics, traces) to the Otel collector, we are making them generally available in all backend systems.
  2. By integrating backend systems against the Otel collector, it gains access to all telemetry data at once.

The second property has something quite magical, as it makes experimenting with vendors extremely straight-forward. Example: There four config lines is all that is needed to send all telemetry to a LightStep account:

  otlp/ls:
     endpoint: ingest.lightstep.com:443
     headers:
       "lightstep-access-token" : ${env:LIGHTSTEP_TOKEN}

The source config file has more examples. While some of them required some more fine tuning to make work, the process of on-boarding a new backed systems is a matter of a few minutes or hours in the worst case. Contrast this with the days before OTel where switching a vendor, required to deploy new agents across all hosts and possible re-do all instrumentation.

Synthetic Probing

The most important question any telemetry system has to answer is: "Is it up right now?".

The way I choose to check this in my setup is to probe each HTTP API every 15 seconds, and see if I get back a 200 OK or (401 Unauthorized) response. For other services (e.g. samba), I check that a tcp connection can be established. This is how I present this information on the dashboard:

image

Implementation

Open Ends

Container Logs

When a service becomes unavailable the next step is usually to check the logs. This is the view that I built out for this purpose:

image

The graph on the top shows the log volume we have for each container. The listing below the graph, shows logs. The variable on the top allows for isolation of individual containers. If a container is selected, both panels are filtered for the specific container.

Implementation

Alternative

Open Ends

Access Logs

The next piece in our observability journey, is to gain information about which requests have been made to HTTP apis. Emitting access logs to a logging is a problematic practice, since they are quite expensive to index easily leak PII, and are hard to correlate across different services. Of course these issues are not that relevant for a home-lab setup. However, with distributed tracing we have a much more powerful telemetry type at our disposal that covers accesslogs as a special case.

This is how this looks like: image

Implementation

Shortcomings Somewhat surprisingly the tracing back-ends I have tried do not seem to cater towards the access logs use-case particularly well. Ideally I would like to have a "Access Log Tail" widget on a Grafana Dashboards, that gives me a live view about what requests are being served. With the said backends I did not get very far:

Open Ends

Host Telemetry

Gaps

Active Operations. Tracing systems have information about which operations are currently ongoing. This is most relevant information. Just the count of open HTTP operations is already very valuable. Yet, I don't see any tooling that exposes this information.

Examples:

Access Logs.

Which operations completed in the last minutes is the next interesting question you may ask about a system. Give me a list of requested URLs and the return codes. Again, tracing systems have this information but they don't cater to this use case.

Condla commented 1 year ago

Hi, first of all: Love the article :-) . You've put a lot of effort into this to test yourself through the setup and I appreciate the insights I get about what you like about Grafana Cloud and what your perceived challenges are.

Here's a few of the shortcomings you've perceived and a few tips and tricks on how to overcome them with Grafana Cloud:

Ideally I would like to have a "Access Log Tail" widget on a Grafana Dashboards, that gives me a live view about what requests are being served.

You can pretty easily get there by using the auto-logging capability of the Grafana Agent. https://grafana.com/docs/tempo/latest/configuration/grafana-agent/automatic-logging/

Jaeger and Tempo on Grafana Cloud currently has no no possibility to show basic meta data about traces in the list. They allow to filter spans on basic metadata and present a list of span-ids with only minimal context information. This is not suitable for the acesslog use-case which needs URL and HTTP return code information at the very least.

I think you can easily implement what you want to achieve using TraceQL + Metrics Generator + Autologging as linked above.

There is also no OTel Processor available, that is allowing to covert spans into logs.

Again: Autologgin of the Grafana Agent. To be clear, this is not an OTel processor, but an additional feature of the Grafana Agent. The Grafana Agent uses the OTel collector and adds a bit of functionality such as auto logging.

HeinrichHartmann commented 1 year ago

Thanks for the reply!

You can pretty easily get there by using the auto-logging capability of the Grafana Agent.

Fundamentally I think that a Tracing backend should be able to answer those queries. If we have to go down the route of doubling telemetry, I would prefer to this in the OTel collector. One important reason is that I don't run Grafana Agent : )

Concerning RED metrics:

Condla commented 1 year ago

Fundamentally I think that a Tracing backend should be able to answer those queries.

Maybe I was thinking too much of a classical "log" rather than giving an answer that solves the question you actually have 😃 So please help me, what you would be doing with those live request logs? You're probably not watching potentially many different services and applications being called live? How would you be generating value out of a "request log"? I suspect that TraceQL might be the answer here as well 😃

Condla commented 1 year ago

Plus, I actually just found out by poking around internally, that we also moved this from the Agent to Tempo (as we've done previously with the metrics generator)

So Tempo can generate "request logs" based on the spans it receives, without the requirement of using a vendor specific agent or feature. Even though this might satisfy your requirement and answers your question, I'm still interested in an answer to my questions above 😉