Che Monitoring - Githubissues

Summary

We propose a system monitoring mechanism that for Cluster and Pod level does not require changes to existing Che code. However, for application monitoring of Che agents it requires some changes:

Add special HTTP monitor requests (telemetry) or using the logs and convert it into monitoring metrics by adding special tag to the record.
Add health check command by each agent for monitoring and register with health check configuration policy to the agent manager.
Add health check agent manager within the Pod for monitoring.
Use Custom environment params that are added to the records of the Che agents for customized purposes, e.g. user’s tenant (customer) id.
Add critical external health check command by relevant agents that will be used by Kubelet livenessProbe to restart the Pod. In addition, add the agent health check configuration as livenessProbe to the Pod configuration file.

Description

Che epics [Complementary]: Tracing - https://github.com/eclipse/che/issues/10298, #10288 Logging - https://github.com/eclipse/che/issues/10290

Background

Monitoring Che Workspace(aka WS) agents is required for anticipate problems and discover bottleneck in production environment. K8S monitor can be categorized as follow

Cluster metrics (System Monitor):

Nodes resource utilization (cpu, memory, disk, network traffic, ...).
Number of available nodes.
Running Pods.

Pods Metrics (System Monitor):

K8S metrics – num of Pod instances vs expected, on progress deployment, health checks.
Container metrics – container cpu, network, memory usage, r/w iops.
Application metrics (Application Monitor):
Health check and other customized metrics.

https://logz.io/blog/kubernetes-monitoring

Prometheus solution

There are many possible combinations of node and cluster-level agents that could comprise a monitoring pipeline. The most popular in K8S is Prometheus which is part of the CNCF. It collects metrics from configured targets at given intervals, evaluates rule expressions, displays the results, and can trigger alerts if some condition is observed to be true. Prometheus comes with its own dashboard which is available for running ad-hoc queries or quick debugging, but for best experience it is recommended to be integrated with visualization backends such as Grafana. https://www.weave.works/technologies/monitoring-kubernetes-with-prometheus

Prometheus Architecture

Prometheus has a cluster level agent and a node level agent (node exporter). The Node exporter is installed as a DaemonSet that gather machine-level metrics in addition to the metrics exposed by the cAdvisor for each container. The Prometheus server is installed per cluster. It scrapes and stores time series data from instrumented jobs either directly or via an intermediary push gateway for short-lived jobs. It stores all scraped samples locally, runs rules over this data and generate alerts. https://prometheus.io/docs/introduction/overview/#architecture

Pushgateway

The Prometheus Pushgateway exists to allow ephemeral and batch jobs to expose their metrics to Prometheus. Since these kinds of jobs may not exist long enough to be scraped, they can instead push their metrics to a Pushgateway. The Pushgateway then exposes these metrics to Prometheus. The Pushgateway is installed per cluster. In order to expose metrics of Che agents and running applications, the application need to send HTTP POST/PUT with the metric object to the Pushgateway URL. https://github.com/prometheus/pushgateway

Application Health Checking

Application health checking is required to detect non-functioning agents from application perspective although Pod and Node are considered healthy e.g. deadlock.

External Application Health Check & Recovery

K8S address this problem by supporting user implemented application health checks that are performed by the Kubelet to ensure that the application is operating correctly. K8S application health checks types:

HTTP Health checks – calling a web hook. Considering http status between 200 and 399 as success, failure otherwise.
Container Exec – execute a command inside the container. Exit with status 0 considered as success otherwise failure.
TCP Socket – open a socket to the container. If connection is established it is considered healthy otherwise failure.

Kubelet can react to two kinds of probes:

LivenessProbe - if Kubelet discovers a failure the container is restarted.
ReadinessProbe – If Kubelet discovers a failure the Pod IP is removed from the services for a period. The container health checks are configured in the livenessProbe/readinessProbe section of the container config.

This can be used as an external Health check for critical services. That way a system outside of the application itself is responsible for monitoring the application and taking action to fix it.

https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#container-probes https://kubernetes.io/docs/tutorials/k8s201/#application-health-checking https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/

Application Health Check Monitoring

While Kubelet use the healthcheck response for a restart action or removing it’s IP it does not give a monitoring tool for different container Health Checks.

The option to do agent monitoring health check using request originating from outside of the Pod is not scalable and can create network loading therefore it should be originated within the Pod.

Each agent should provide health check command for monitoring. To perform the health check there should be a dedicated agent (Health check manager agent) that triggers the health check commands every interval. Each agent need to register to the health check agent manager and configure it’s health check policy.

The agent manager can expose the results by one of the following:

Expose it to cAdvisor end point (still in alpha. see below).
Send Prometheus metrics to the Pushgateway Pod.
Send dedicated logs that will be monitored – recommended.

cAdvisor solution - Since K8S 1.2 a new feature (still in Alpha) allows cAdvisor to collect custom metrics from applications running in containers, if these metrics are exposed in the Prometheus format natively. https://github.com/google/cadvisor/blob/master/docs/application_metrics.md Exposing to cAdvisor is not recommend as it is still in alpha and will add additional dependencies with other components.

Sending Prometheus metrics is less recommended as it creates additional complexity by having the Pushgateway component.

Using the logging [See #10290] for application monitoring is preferred to be more homogenous as it is using the existent logging system and can be correlated to additional information supplied by it. In this case the PushGateway is not required.

Health check agent manger

The health check agent manager can be implemented as

Independent agent within the container.
Healthcheck instruction within the Docker. Docker provides Healthcheck instruction that checks the container health by running a command inside the container every time interval.

The Proposed solution for monitoring application health check should be used also to a single centric component (e.g. WS Master) for homogenous solution.

Implementation recommendation

System Monitor of K8S Cluster and Node based on Prometheus system.
Application Monitor of WS agents within the container should follow
- Sending metrics Sending the metrics by adding logs to the WS agent with specific tag that will
  indicate that this log is used for monitoring.
- Custom environment params Added to the records of Che agents for customized purposes, e.g. user’s tenant (customer) id.
  - Internal health check Provide health check command by each agent for monitoring. In addition each agent should register to the health check agent manager with health check configuration policy.
  - Health Check agent manager Agent within the Pod that can be implemented as either Independent agent or Healthcheck instruction within the Docker (should be further investigated).
  - External health check Provide critical health check command by relevant agents to be used by Kubelet livenessProbe to restart the Pod. In addition, the agent should add health check configuration policy to the livenessProbe part in the Pod configuration file.

Implementation

[x] Exposing metrics via HTTP for Prometheus consumption https://github.com/eclipse/che/issues/11384
[x] Create REST method to get all workspaces with filter https://github.com/eclipse/che/issues/12091
[x] [Metric] Number of workspaces by state https://github.com/eclipse/che/issues/12092
[x] Deploy Prometheus in Che Helm Chart https://github.com/eclipse/che/issues/12059
[x] Document prometheus and grafana in release notes and documentation https://github.com/eclipse/che/issues/12137
[ ] [Metric] Workspace related metrics https://github.com/eclipse/che/issues/12096
[x] [Metric] Number of failed workspace startup https://github.com/eclipse/che/issues/12242
[x] [Metric] API error rate (number of 500 responses from che server) https://github.com/eclipse/che/issues/12244
[ ] Monitoring workspaces https://github.com/eclipse/che/issues/12540 This is a study issue that can be expanded to a multiple sub-issues.
[ ] JsonRPC message rate https://github.com/eclipse/che/issues/12094
[x] Number of users https://github.com/eclipse/che/issues/12093
[ ] Allow to configure metrics port in K8s chart or OpenShift template https://github.com/eclipse/che/issues/12062
[x] Deploy Prometheus/Grafana stack in OpenShift template https://github.com/eclipse/che/issues/12541
[x] Grafana dashboard https://github.com/eclipse/che/issues/12542
[ ] Create documentation for monitoring use-case https://github.com/eclipse/che/issues/12543
[ ] Che monitoring/tracing in istio environment https://github.com/eclipse/che/issues/12546
[ ] [Metrics] Workspace memory consumption https://github.com/eclipse/che/issues/13045
[ ] [Metrics] Che7 project/commands statistic https://github.com/eclipse/che/issues/13047
[ ] [Metrics] Che7 components statistic https://github.com/eclipse/che/issues/13046
[ ] [Metrics] Number of connections/requests https://github.com/eclipse/che/issues/13132
[ ] [Metrics] Observability requirements for hosted Che https://github.com/eclipse/che/issues/13270

eclipse-che / che

Che Monitoring #10329