gardener / dashboard

Web-based GUI for Gardener installations.
Apache License 2.0
209 stars 103 forks source link

Improve Dashboard Observability 🔎 #967

Closed andre-dossinger closed 1 year ago

andre-dossinger commented 3 years ago

What would you like to be added: At the time of writing the gardener dashboard does not emphasize observability. Mainly basic pod based metrics can be captured at the moment like: RAM, CPU, Network, etc. This issue proposes to add advanced observability capabilities to acquire better insights of the running system.

Why is this needed: Advanced observability features bring many advantages. At the moment the reasons identified as most important are: additional context for troubleshooting, easier root cause analysis in case of outages, as well as operational insights for optimizations and the like.

Issue Purpose

This issue will serve as a central point for collecting ideas and will continuously evolve. The goal of this issue is also to narrow down the scope of the change request as the options are better understood.

ToDos

Domain

Observability is a property of a system. A well observable system enables inference of its internal state by looking at its external outputs. The term originates in control theory and has recently also become popular in the field of software engineering (especially in the field of distributed systems). In that context literature often divides observability into three pillars:

The ideas below are categorized by those subtopics.

Ideas

This section mainly contains links to interesting tooling to be looked at. Some tools are not integrable in the sense of code adjustments. In those cases it should be ensured that compatibility exists (might result in documentation) if accepted.

Metrics

Metrics are measurements of properties continuously omitted by components. Usually they are quantifiable or countable. Examples are: CPU utilization, RAM usage or network throughput.

Node.js:

Tooling ideas from issue #864:

Issue #864

Collectors:

General properties:

Logs

Logs are information deliberately exposed by software components, such as errors, etc. With distributed systems one of the challenges is to collect and store them at a central place in a usable format.

(Distributed) Traces

Distributed tracing enables profiling and easier troubleshooting within distributed systems. It provides insides in how requests propagate through a system and helps to correlate logs. Applications involved must be instrumented to emit spans.

/cc @petersutter @grolu @holgerkoser

grolu commented 1 year ago

/close We added metrics endpoint for the dashboard to export basic metrics. Additional metrics can be added easily (not part of this issue). Adding distributed traces / logs is currently out of scope.