Improve Dashboard Observability 🔎

What would you like to be added: At the time of writing the gardener dashboard does not emphasize observability. Mainly basic pod based metrics can be captured at the moment like: RAM, CPU, Network, etc. This issue proposes to add advanced observability capabilities to acquire better insights of the running system.

Why is this needed: Advanced observability features bring many advantages. At the moment the reasons identified as most important are: additional context for troubleshooting, easier root cause analysis in case of outages, as well as operational insights for optimizations and the like.

Issue Purpose

This issue will serve as a central point for collecting ideas and will continuously evolve. The goal of this issue is also to narrow down the scope of the change request as the options are better understood.

ToDos

[x] #1411
[x] Expose (internal) Dashboard metrics (like active websocket connections etc.)
[x] Create (Grafana) Dashboards
[ ] Add prometheus metrics endpoint for terminal bootstrapper
[ ] (Distributed) Traces

Domain

Observability is a property of a system. A well observable system enables inference of its internal state by looking at its external outputs. The term originates in control theory and has recently also become popular in the field of software engineering (especially in the field of distributed systems). In that context literature often divides observability into three pillars:

Metrics
Logs
Traces

The ideas below are categorized by those subtopics.

Ideas

This section mainly contains links to interesting tooling to be looked at. Some tools are not integrable in the sense of code adjustments. In those cases it should be ensured that compatibility exists (might result in documentation) if accepted.

Metrics

Metrics are measurements of properties continuously omitted by components. Usually they are quantifiable or countable. Examples are: CPU utilization, RAM usage or network throughput.

Node.js:

Tooling ideas from issue #864:

Preoom: Retrieves & observes Kubernetes Pod resource (CPU, memory) utilisation
Iapetus: Prometheus metrics server
Lightship: Abstracts readiness, liveness and startup checks and graceful shutdown of Node.js services running in Kubernetes

Issue #864

Collectors:

cAdvisor: Captures general container metrics (might already be captured)

General properties:

Queue length
Requests
Active users
Cashes

Logs

Logs are information deliberately exposed by software components, such as errors, etc. With distributed systems one of the challenges is to collect and store them at a central place in a usable format.

Fluentd: Log collector
Loki: Log aggregation system
Roarr: JSON logger for Node.js and browser

(Distributed) Traces

Distributed tracing enables profiling and easier troubleshooting within distributed systems. It provides insides in how requests propagate through a system and helps to correlate logs. Applications involved must be instrumented to emit spans.

Jeager: End-to-end distributed tracing
OpenTelemetry: Instrumentation

/cc @petersutter @grolu @holgerkoser

gardener / dashboard