HeinrichHartmann commented 1 year ago

Home Lab Observability

In this post, we are exploring free/observability tooling on the example of my home-lab setup.

Design Goals

Explore and try-out modern open-source observability tooling: OpenTelemetry, Prometheus, Loki, Jaeger, Grafana, ...
Try out free-tier SaaS Observability Solutions: GrafanaCloud, LightStep, NewRelic, DataDog, ...
Esablish best-practice observability for home-lab including:
- Synthetic availability measurements for all exposed web-services
- Log aggregation and query capabilities for docker and systemd logs
- Distributed tracing for the load-balancer + X
- RED Metrics for all exposed APIs
- Resource monitoring for all running containers

The OpenTelemetry Collector

The central hub of the setup is the OpenTelemetry (OTEL) Collector.

OpenTelemetry is a industry standard, that is bridging across a large number of vendors and open source tools and covers all three telemetry pillars (logs, metrics and traces). The open telemetry collector is a general data-broker, that accepts telemetry from a large variety of sources (aka. receivers) and is able to forward telemetry to a large variety of telemetry backend systems (see exporters).

This means that:

By sending all all telemetry sources (logs, metrics, traces) to the Otel collector, we are making them generally available in all backend systems.
By integrating backend systems against the Otel collector, it gains access to all telemetry data at once.

The second property has something quite magical, as it makes experimenting with vendors extremely straight-forward. Example: There four config lines is all that is needed to send all telemetry to a LightStep account:

  otlp/ls:
     endpoint: ingest.lightstep.com:443
     headers:
       "lightstep-access-token" : ${env:LIGHTSTEP_TOKEN}

The source config file has more examples. While some of them required some more fine tuning to make work, the process of on-boarding a new backed systems is a matter of a few minutes or hours in the worst case. Contrast this with the days before OTel where switching a vendor, required to deploy new agents across all hosts and possible re-do all instrumentation.

Synthetic Probing

The most important question any telemetry system has to answer is: "Is it up right now?".

The way I choose to check this in my setup is to probe each HTTP API every 15 seconds, and see if I get back a 200 OK or (401 Unauthorized) response. For other services (e.g. samba), I check that a tcp connection can be established. This is how I present this information on the dashboard:

Implementation

Services are discovered form the local configuration with a small shell script that scans my docker-compose files for load-balancer configuration. The shell script is run every hour from cron, and keeps a file pinghosts.yaml up to date.
Prometheus + Blackbox exporter are used to probe all the APIS. This results in metrics probe_success being published for each target that take values 1 or 0 corresponding to the availability status.
Grafana State Timeline panel is used for the final visualization.

Open Ends

Prometheus metrics actually do not go through the OTel collector at the moment. Surprisingly there is no good way to send Prometheus data to an OTel Collector. The official way seems to be to scrape metrics directly from the OTel collector, with a builtin prometheus receiver module. However, I do have working Prometheus setup, and copying the config 1o1 did not work on the spot. Given that Prometheus supports remote write developing the corresponding receiver should be a relatively straight forward project.
Deploy a second probing location to get a look from the outside. As all services are only exposed on a trusted network (VPN, LAN) we can't use SaaS tools for this, but need to deploy another Prometheus instance on the network.

Container Logs

When a service becomes unavailable the next step is usually to check the logs. This is the view that I built out for this purpose:

The graph on the top shows the log volume we have for each container. The listing below the graph, shows logs. The variable on the top allows for isolation of individual containers. If a container is selected, both panels are filtered for the specific container.

Implementation

Logs are exported by the docker daemon using the fluentd log driver as configured here
The OTel collector has a fluentd receiver configured and forwards this logs to Loki
Loki is used for storage and retrieval of logs
The Grafana dashboard makes use of the Loki DataSource

Alternative

Dozzle provide a simple turn-key solution for viewing container logs

Open Ends

A Live trail function is missing.

Access Logs

The next piece in our observability journey, is to gain information about which requests have been made to HTTP apis. Emitting access logs to a logging is a problematic practice, since they are quite expensive to index easily leak PII, and are hard to correlate across different services. Of course these issues are not that relevant for a home-lab setup. However, with distributed tracing we have a much more powerful telemetry type at our disposal that covers accesslogs as a special case.

This is how this looks like:

Implementation

We are collecting tracing data from the ingress proxy traefik (config)
Data is received by a jaeger receiver on the OTel collector (config) and forward to three different tracing backends:
- A local Jaeger installation
- Grafana Cloud (providing Temo-aaS)
- LightStep
The screenshot above is taken from the LightStep UI (using the Explorer function)

Shortcomings Somewhat surprisingly the tracing back-ends I have tried do not seem to cater towards the access logs use-case particularly well. Ideally I would like to have a "Access Log Tail" widget on a Grafana Dashboards, that gives me a live view about what requests are being served. With the said backends I did not get very far:

Jaeger and Tempo on Grafana Cloud currently has no no possibility to show basic meta data about traces in the list. They allow to filter spans on basic metadata and present a list of span-ids with only minimal context information. This is not suitable for the acesslog use-case which needs URL and HTTP return code information at the very least.
The LightStep Explorer above is the closest I have been able to get. It still has some gaps:
- No possibility to add more than one attribute to the list
- No ability to save or embed an explorer view
- No ability to keep a live-trail of data (only snapshots are collected when "play" is pressed
- LightStep seems to be moving away from the Explorer without suitable replacements (see https://app.lightstep.com/play/explorer)
There is also no OTel Processor available, that is allowing to covert spans into logs. This would only be a workaround since we don't want to index access logs in a logging backend, but the feature is super straight forward and provides the requested features.

Open Ends

It would be great to have access log information for tcp services (like samba) as well.
From the ~20 services I run internally only one (Grafana) is currently configured for tracing and provices internal spans about the interactions.

Host Telemetry

Gaps

Active Operations. Tracing systems have information about which operations are currently ongoing. This is most relevant information. Just the count of open HTTP operations is already very valuable. Yet, I don't see any tooling that exposes this information.

Examples:

otel-collector/zpages has information of running traces
pg_stats_activity
Circonus IRONdb HTTP Observer / Crash Reporter

Access Logs.

Which operations completed in the last minutes is the next interesting question you may ask about a system. Give me a list of requested URLs and the return codes. Again, tracing systems have this information but they don't cater to this use case.

Tempo on Grafana Cloud
- No possibility to show basic meta data about traces in the list
LightStep Explorer
- No possibility to add more than one attribute to the list
- No ability to save or embed an explorer view
- No ability to keep a live-trail of data (only snapshots are collected when "play" is pressed
- LightStep seems to be moving away from the Explorer without suitable replacements (see https://app.lightstep.com/play/explorer)
There is also no OTel Processor available, that is allowing to covert spans into logs. This would only be a workaround since we don't want to index access logs in a logging backend, but the feature is super straight forward and provides the requested features.

Condla commented 1 year ago

Hi, first of all: Love the article :-) . You've put a lot of effort into this to test yourself through the setup and I appreciate the insights I get about what you like about Grafana Cloud and what your perceived challenges are.

Here's a few of the shortcomings you've perceived and a few tips and tricks on how to overcome them with Grafana Cloud:

Ideally I would like to have a "Access Log Tail" widget on a Grafana Dashboards, that gives me a live view about what requests are being served.

You can pretty easily get there by using the auto-logging capability of the Grafana Agent. https://grafana.com/docs/tempo/latest/configuration/grafana-agent/automatic-logging/

Jaeger and Tempo on Grafana Cloud currently has no no possibility to show basic meta data about traces in the list. They allow to filter spans on basic metadata and present a list of span-ids with only minimal context information. This is not suitable for the acesslog use-case which needs URL and HTTP return code information at the very least.

Are you using TraceQL? https://grafana.com/docs/tempo/latest/traceql/
Are you using the Metrics Generator? https://grafana.com/docs/grafana-cloud/data-configuration/traces/metrics-generator/ The metrics generator allows you to generate RED metrics based on OTEL attributes (and saves the results as metrics using metric labels)

I think you can easily implement what you want to achieve using TraceQL + Metrics Generator + Autologging as linked above.

There is also no OTel Processor available, that is allowing to covert spans into logs.

Again: Autologgin of the Grafana Agent. To be clear, this is not an OTel processor, but an additional feature of the Grafana Agent. The Grafana Agent uses the OTel collector and adds a bit of functionality such as auto logging.

HeinrichHartmann commented 1 year ago

Thanks for the reply!

You can pretty easily get there by using the auto-logging capability of the Grafana Agent.

Fundamentally I think that a Tracing backend should be able to answer those queries. If we have to go down the route of doubling telemetry, I would prefer to this in the OTel collector. One important reason is that I don't run Grafana Agent : )

Concerning RED metrics:

TraceQL approach looks great! Will try this out.
MetricsGenerator is similar to the SpanMetrics connector in OTel. Legit approach as well!

Condla commented 1 year ago

Fundamentally I think that a Tracing backend should be able to answer those queries.

Maybe I was thinking too much of a classical "log" rather than giving an answer that solves the question you actually have 😃 So please help me, what you would be doing with those live request logs? You're probably not watching potentially many different services and applications being called live? How would you be generating value out of a "request log"? I suspect that TraceQL might be the answer here as well 😃

Condla commented 1 year ago

Plus, I actually just found out by poking around internally, that we also moved this from the Agent to Tempo (as we've done previously with the metrics generator)

So Tempo can generate "request logs" based on the spans it receives, without the requirement of using a vendor specific agent or feature. Even though this might satisfy your requirement and answers your question, I'm still interested in an answer to my questions above 😉

HeinrichHartmann / svc

Observability #1

Home Lab Observability

Design Goals

The OpenTelemetry Collector

Synthetic Probing

Container Logs

Access Logs

Host Telemetry

Gaps