Kubernetes integration: add CPU throttling metric from cAdvisor

inge4pres commented 5 months ago

The list of container metrics we are collecting from cAdvisor through the kubelet HTTP endpoint is listed here https://github.com/elastic/integrations/blob/b8da31eac0835513745b90cc225e06f0d105fb21/packages/kubernetes/docs/kubelet.md?plain=1#L235-L260

The full list of metrics that cAdvisor exposes through the kubelet node monitoring endpoint is detailed here.

One key metric that is often helpful to troubleshoot performance problems due to CPU overcommittment is container_cpu_cfs_throttled_seconds_total, a counter indicating how much time a container process is throttled, parked to make room for other tasks that are also being run in the same node. I am not sure we are already collecting and exposing it.

If we aren't, we should add this metric in our monitoring, and encourage its usage. It is the first indicator of resource limits or binpacking mis-configurations happening in the cluster and it's the number 1 cause of performance degradation.

Other relevant metrics that may be helpful from the same data set, and that we are not collecting yet, are:

container_processes (gauge)
container_threads (gauge)
container_oom_events_total (counter)

bturquet commented 5 months ago

@MichaelKatsoulis can you add this one to the weekly review to have a better idea of the effort and check if we could add it to an upcoming iteration ? This is something that could be very useful for MIS.

simitt commented 4 months ago

@MichaelKatsoulis @bturquet is there an update on when this task could be scheduled? It'd be a big win for us to have this level of insights.

MichaelKatsoulis commented 4 months ago

Hi @simitt . It is part of our April iteration. Soon someone will be assigned to work on this.

StephanErb commented 4 months ago

Would it be possible to add one more super useful metric, kube_pod_status_reason? It encodes pod termination reasons like Evicted|NodeAffinity|NodeLost|Shutdown|UnexpectedAdmissionError which is otherwise only visible via events.

I know that the metric is marked as experimental but it is available for several years and should be safe to use.

I can also add another ticket if that is preferred.

inge4pres commented 4 months ago

@StephanErb great suggestion 👍🏼 I believe a dedicated card would be best, thanks 🙏🏼

inge4pres commented 3 months ago

@MichaelKatsoulis I see this has been postponed, is there a blocker or it's a matter of prioritization?

bturquet commented 3 months ago

Hi @inge4pres, @MichaelKatsoulis is on bank holidays and will give more details about priotization when he'll come back. AFAIK there is no blocker but we collated other similar needs in this issue:

https://github.com/elastic/integrations/issues/9753

And we want to address all these needs at once.

MichaelKatsoulis commented 3 months ago

@inge4pres This in the backlog for 8.15. We will start working on it soon!

gizas commented 2 months ago

@inge4pres hello hello my friend !!!

We started looking the addition of CPU throttling metrics from cAdvisor. For the kubernetes side this would need an addition of new metricset because we dont currently make use of cadvisor. So it is a big change and it would require some addtional work from us

BUT ..... Because this response is prometheus we thought that the we can make use of the prometheus integration. And we proved that this can work:

1. Testing in GKE where the container certificates provided are issued from ca trusted authority

Prometheus Manifest

```yaml - id: prometheus/metrics-prometheus-4307f02c-da80-4ab2-912e-e43f3e8a5c26 name: prometheus-1 revision: 3 type: prometheus/metrics use_output: default meta: package: name: prometheus version: 1.15.2 data_stream: namespace: default package_policy_id: 4307f02c-da80-4ab2-912e-e43f3e8a5c26 streams: - id: >- prometheus/metrics-prometheus.collector-4307f02c-da80-4ab2-912e-e43f3e8a5c26 data_stream: dataset: prometheus.collector metricsets: - collector hosts: - 'https://${env.NODE_NAME}:10250' metrics_filters.exclude: null metrics_filters.include: - container_cpu_cfs_throttled_seconds_total - container_processes - container_threads - container_oom_events_total metrics_path: /metrics/cadvisor period: 10s rate_counters: true bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token ssl.certificate_authorities: - /var/run/secrets/kubernetes.io/serviceaccount/ca.crt use_types: true username: '' password: null ```

Update Clusterrole

```yaml - apiGroups: - "" resources: - nodes/stats - nodes/metrics ```

Everything works:

Screenshot 2024-06-06 at 10 33 08 AM

2. Testing in local kind k8s cluster where container certificates provided are issued from a non-trusted ca

Because the ca is not trusted, then our connection to cadvisor is getting a 401 unathorised error.

So I have made a draft PR in Prometheus Collector with changes needed here

So with this Prometheus integration also the cadvisor metrics collection works in local clusters.

Let us know if this is an acceptable approach for you and we can merge the integration. FYI Prometheus integration v1.15.3 and kibana.version: "^8.12.1"

elastic / integrations

Kubernetes integration: add CPU throttling metric from cAdvisor #9307