What would you like to be added: At the time of writing the gardener dashboard does not emphasize observability. Mainly basic pod based metrics can be captured at the moment like: RAM, CPU, Network, etc. This issue proposes to add advanced observability capabilities to acquire better insights of the running system.
Why is this needed: Advanced observability features bring many advantages. At the moment the reasons identified as most important are: additional context for troubleshooting, easier root cause analysis in case of outages, as well as operational insights for optimizations and the like.
Issue Purpose
This issue will serve as a central point for collecting ideas and will continuously evolve. The goal of this issue is also to narrow down the scope of the change request as the options are better understood.
ToDos
[x] #1411
[x] Expose (internal) Dashboard metrics (like active websocket connections etc.)
[x] Create (Grafana) Dashboards
[ ] Add prometheus metrics endpoint for terminal bootstrapper
[ ] (Distributed) Traces
Domain
Observability is a property of a system. A well observable system enables inference of its internal state by looking at its external outputs. The term originates in control theory and has recently also become popular in the field of software engineering (especially in the field of distributed systems). In that context literature often divides observability into three pillars:
Metrics
Logs
Traces
The ideas below are categorized by those subtopics.
Ideas
This section mainly contains links to interesting tooling to be looked at. Some tools are not integrable in the sense of code adjustments. In those cases it should be ensured that compatibility exists (might result in documentation) if accepted.
Metrics
Metrics are measurements of properties continuously omitted by components. Usually they are quantifiable or countable. Examples are: CPU utilization, RAM usage or network throughput.
cAdvisor: Captures general container metrics (might already be captured)
General properties:
Queue length
Requests
Active users
Cashes
Logs
Logs are information deliberately exposed by software components, such as errors, etc. With distributed systems one of the challenges is to collect and store them at a central place in a usable format.
Distributed tracing enables profiling and easier troubleshooting within distributed systems. It provides insides in how requests propagate through a system and helps to correlate logs. Applications involved must be instrumented to emit spans.
/close
We added metrics endpoint for the dashboard to export basic metrics. Additional metrics can be added easily (not part of this issue).
Adding distributed traces / logs is currently out of scope.
What would you like to be added: At the time of writing the gardener dashboard does not emphasize observability. Mainly basic pod based metrics can be captured at the moment like: RAM, CPU, Network, etc. This issue proposes to add advanced observability capabilities to acquire better insights of the running system.
Why is this needed: Advanced observability features bring many advantages. At the moment the reasons identified as most important are: additional context for troubleshooting, easier root cause analysis in case of outages, as well as operational insights for optimizations and the like.
Issue Purpose
This issue will serve as a central point for collecting ideas and will continuously evolve. The goal of this issue is also to narrow down the scope of the change request as the options are better understood.
ToDos
Domain
Observability is a property of a system. A well observable system enables inference of its internal state by looking at its external outputs. The term originates in control theory and has recently also become popular in the field of software engineering (especially in the field of distributed systems). In that context literature often divides observability into three pillars:
The ideas below are categorized by those subtopics.
Ideas
This section mainly contains links to interesting tooling to be looked at. Some tools are not integrable in the sense of code adjustments. In those cases it should be ensured that compatibility exists (might result in documentation) if accepted.
Metrics
Metrics are measurements of properties continuously omitted by components. Usually they are quantifiable or countable. Examples are: CPU utilization, RAM usage or network throughput.
Node.js:
Tooling ideas from issue #864:
Issue #864
Collectors:
General properties:
Logs
Logs are information deliberately exposed by software components, such as errors, etc. With distributed systems one of the challenges is to collect and store them at a central place in a usable format.
(Distributed) Traces
Distributed tracing enables profiling and easier troubleshooting within distributed systems. It provides insides in how requests propagate through a system and helps to correlate logs. Applications involved must be instrumented to emit spans.
/cc @petersutter @grolu @holgerkoser