elastic / elastic-agent

Elastic Agent - single, unified way to add monitoring for logs, metrics, and other types of data to a host.
Other
127 stars 137 forks source link

Provide last_terminated_reason_timestamp for when last_terminated_reason event happens in K8s Integration #3802

Closed gizas closed 2 months ago

gizas commented 10 months ago

Describe the enhancement: Provide the time that last_terminated_reason of cotainer occurred in Kubernetes Integration Possible fields that can support that (from KSM-Pod metrics) can be kube_pod_status_container_ready_time.

Describe a specific use case for the enhancement or feature: The "kubernetes.container.status.last_terminated_reason" is a useful field, especially in the case of missed metrics. This metric on its own can be difficult to identify when it happened as the container or pod can keep this error since last restart and also can be This request is specifically for a timestamp for when this last_terminated_reason occurred.

What is the definition of done?

sophiec20 commented 10 months ago

I imagine that the accuracy of this value will be hard to pinpoint, but even an approx time would be great ... from looking at the data, it's useful to know if an error (such as OOMKilled) happened 7 mins or 7 hours or 7 days ago.

afharo commented 10 months ago

Possible fields that can support that (from KSM-Pod metrics) can be kube_pod_status_container_ready_time.

I just tried using that field (https://github.com/elastic/beats/pull/37192), and unfortunately, they seem to come in different events, leading to separate Metricbeat entries :(

We need to look at any other potential fields or have Metricbeat somehow generate it for us.

It looks like there's a request in the kubernetes repo to log the event whenever there's an OOMKilled: https://github.com/kubernetes/kubernetes/issues/69676

tetianakravchenko commented 9 months ago

I just tried using that field (https://github.com/elastic/beats/pull/37192), and unfortunately, they seem to come in different events, leading to separate Metricbeat entries :(

kube_pod_status_container_ready_time identify the time when the Readiness probe was successful and the container is ready to a accept connections

Checking the existent metrics:

=> there is no such metric at the moment that would provide the time that last_terminated_reason of cotainer occurred

What is needed:

What is the interest of this issue is a cs.LastTerminationState.Terminated.FinishedAt as I understood:

Containers:
  kube-scheduler:
    ...
    State:          Running
      Started:      Wed, 27 Dec 2023 17:05:34 +0100
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Wed, 27 Dec 2023 13:38:18 +0100
      Finished:     Wed, 27 Dec 2023 17:05:33 +0100
    Ready:          True
    Restart Count:  1
tetianakravchenko commented 9 months ago

Here is a PR to report kube_pod_container_status_last_terminated_timestamp https://github.com/kubernetes/kube-state-metrics/pull/2291

tetianakravchenko commented 5 months ago

Progress: