Closed yarivlifchuk closed 5 years ago
FWIW, keeping at least one form of the metrics available as a http-pollable prometheus-exporter url would be pretty future-proof, even if the cAdvisor machinery were to go away.
Closing. I think we implemented core part.
Summary
We propose a system monitoring mechanism that for Cluster and Pod level does not require changes to existing Che code. However, for application monitoring of Che agents it requires some changes:
Description
Che epics [Complementary]: Tracing - https://github.com/eclipse/che/issues/10298, #10288 Logging - https://github.com/eclipse/che/issues/10290
Background
Monitoring Che Workspace(aka WS) agents is required for anticipate problems and discover bottleneck in production environment. K8S monitor can be categorized as follow
Cluster metrics (System Monitor):
Pods Metrics (System Monitor):
Application metrics (Application Monitor):
https://logz.io/blog/kubernetes-monitoring
Prometheus solution
There are many possible combinations of node and cluster-level agents that could comprise a monitoring pipeline. The most popular in K8S is Prometheus which is part of the CNCF. It collects metrics from configured targets at given intervals, evaluates rule expressions, displays the results, and can trigger alerts if some condition is observed to be true. Prometheus comes with its own dashboard which is available for running ad-hoc queries or quick debugging, but for best experience it is recommended to be integrated with visualization backends such as Grafana. https://www.weave.works/technologies/monitoring-kubernetes-with-prometheus
Prometheus Architecture
Prometheus has a cluster level agent and a node level agent (node exporter). The Node exporter is installed as a DaemonSet that gather machine-level metrics in addition to the metrics exposed by the cAdvisor for each container. The Prometheus server is installed per cluster. It scrapes and stores time series data from instrumented jobs either directly or via an intermediary push gateway for short-lived jobs. It stores all scraped samples locally, runs rules over this data and generate alerts. https://prometheus.io/docs/introduction/overview/#architecture
Pushgateway
The Prometheus Pushgateway exists to allow ephemeral and batch jobs to expose their metrics to Prometheus. Since these kinds of jobs may not exist long enough to be scraped, they can instead push their metrics to a Pushgateway. The Pushgateway then exposes these metrics to Prometheus. The Pushgateway is installed per cluster. In order to expose metrics of Che agents and running applications, the application need to send HTTP POST/PUT with the metric object to the Pushgateway URL. https://github.com/prometheus/pushgateway
Application Health Checking
Application health checking is required to detect non-functioning agents from application perspective although Pod and Node are considered healthy e.g. deadlock.
External Application Health Check & Recovery
K8S address this problem by supporting user implemented application health checks that are performed by the Kubelet to ensure that the application is operating correctly. K8S application health checks types:
Kubelet can react to two kinds of probes:
This can be used as an external Health check for critical services. That way a system outside of the application itself is responsible for monitoring the application and taking action to fix it.
https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#container-probes https://kubernetes.io/docs/tutorials/k8s201/#application-health-checking https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/
Application Health Check Monitoring
While Kubelet use the healthcheck response for a restart action or removing it’s IP it does not give a monitoring tool for different container Health Checks.
The option to do agent monitoring health check using request originating from outside of the Pod is not scalable and can create network loading therefore it should be originated within the Pod.
Each agent should provide health check command for monitoring. To perform the health check there should be a dedicated agent (Health check manager agent) that triggers the health check commands every interval. Each agent need to register to the health check agent manager and configure it’s health check policy.
The agent manager can expose the results by one of the following:
cAdvisor solution - Since K8S 1.2 a new feature (still in Alpha) allows cAdvisor to collect custom metrics from applications running in containers, if these metrics are exposed in the Prometheus format natively. https://github.com/google/cadvisor/blob/master/docs/application_metrics.md Exposing to cAdvisor is not recommend as it is still in alpha and will add additional dependencies with other components.
Sending Prometheus metrics is less recommended as it creates additional complexity by having the Pushgateway component.
Using the logging [See #10290] for application monitoring is preferred to be more homogenous as it is using the existent logging system and can be correlated to additional information supplied by it. In this case the PushGateway is not required.
Health check agent manger
The health check agent manager can be implemented as
The Proposed solution for monitoring application health check should be used also to a single centric component (e.g. WS Master) for homogenous solution.
Implementation recommendation
indicate that this log is used for monitoring.
Implementation