Memory leak in CW agent prometheus 1.247348.0b251302

ashevtsov-wawa commented 3 years ago

After upgrading CW agent Prometheus from 1.247347.5b250583 to 1.247348.0b251302 the pod started getting killed by Kubernetes (OOMKilled). Memory limit is set to 2000m. Tried increasing the limit up to 8000m to no avail. Downgrading to 1.247347.5b250583 fixes the issue (with 2000m limit). We run the agent in EKS 1.19.

We are experiencing this in a couple environments, each running over 120 pods (including those of daemonsets). Environments where this is not an issue have ~30-50 pods running. Last messages in the container logs of the killed pods aren't consistent one instance:

...
2021-09-02T15:49:55Z D! [outputs.cloudwatchlogs] Wrote batch of 179 metrics in 1.201294006s
2021-09-02T15:49:55Z D! [outputs.cloudwatchlogs] Buffer fullness: 0 / 10000 metrics
2021-09-02T15:49:55Z D! [outputs.cloudwatchlogs] Pusher published 8 log events to group: /aws/containerinsights/redacted/prometheus stream: redacted with size 4 KB in 110.502908ms.
2021-09-02T15:49:56Z D! [outputs.cloudwatchlogs] Buffer fullness: 0 / 10000 metrics

another instance (same cluster):

...
2021-09-02T15:57:29Z D! Drop metric with NaN or Inf value: &{map[app:redacted app_kubernetes_io_instance:redacted app_kubernetes_io_name:redacted application:redacted client_id:producer-1 instance:10.1.2.3:15020 istio_io_rev:default job:kubernetes-pods kafka_version:2.6.0 kubernetes_namespace:redacted kubernetes_pod_name:redacted-6dfd7cd497-qr2hb node_id:node--1 pod_template_hash:6dfd7cd497 prom_metric_type:gauge security_istio_io_tlsMode:istio service_istio_io_canonical_name:redacted service_istio_io_canonical_revision:0.1.0 spring_id:redacted.items.producer.producer-1 version:0.1.0] kafka_producer_node_response_rate kafka_producer_node_response_rate kubernetes-pods 10.1.2.3 NaN gauge 1630598180331}
2021-09-02T15:57:29Z D! Drop metric with NaN or Inf value: &{map[app:redacted app_kubernetes_io_instance:redacted app_kubernetes_io_name:redacted application:redacted client_id:producer-2 instance:10.1.2.3:15020 istio_io_rev:default job:kubernetes-pods kafka_version:2.6.0 kubernetes_namespace:redacted kubernetes_pod_name:redacted-6dfd7cd497-qr2hb node_id:node--1 pod_template_hash:6dfd7cd497 prom_metric_type:gauge security_istio_io_tlsMode:istio service_istio_io_canonical_name:redacted service_istio_io_canonical_revision:0.1.0 spring_id:redacted.items.producer.producer-2 version:0.1.0] kafka_producer_node_response_rate kafka_producer_node_response_rate kubernetes-pods 10.1.2.3 NaN gauge 1630598180331}
2021-09-02T15:57:29Z D! Drop metric with NaN or Inf value: &{map[app:redacted app_kubernetes_io_instance:redacted app_kubernetes_io_name:redacted application:redacted client_id:consumer-redacted-group-1 instance:10.1.2.3:15020 istio_io_rev:default job:kubernetes-pods kafka_version:2.6.0 kubernetes_namespace:redacted kubernetes_pod_name:redacted-6dfd7cd497-qr2hb pod_template_hash:6dfd7cd497 prom_metric_type:gauge security_istio_io_tlsMode:istio service_istio_io_canonical_name:redacted service_istio_io_canonical_revision:0.1.0 spring_id:redacted.items.consumer.consumer-redacted-group-1 version:0.1.0] kafka_consumer_last_poll_seconds_ago kafka_consumer_last_poll_seconds_ago kubernetes-pods 10.1.2.3 NaN gauge 1630598180331}
2021-09-02T15:57:29Z D! Drop metric with NaN or Inf value: &{map[app:redacted app_kubernetes_io_instance:redacted app_kubernetes_io_name:redacted application:redacted client_id:consumer-redacted-group-3 instance:10.1.2.3:15020 istio_io_rev:default job:kubernetes-pods kafka_version:2.6.0 kubernetes_namespace:redacted kubernetes_pod_name:redacted-6dfd7cd497-qr2hb pod_template_hash:6dfd7cd497 prom_metric_type:gauge security_istio_io_tlsMode:istio service_istio_io_canonical_name:redacted service_istio_io_canonical_revision:0.1.0 spring_id:redacted.items.consumer.consumer-redacted-group-3 version:0.1.0] kafka_consumer_last_poll_seconds_ago kafka_consumer_last_poll_seconds_ago kubernetes-pods 10.1.2.3 NaN gauge 1630598180331}

Let me know if you need any other information/logs that will help in troubleshooting.

github-actions[bot] commented 2 years ago

This issue was marked stale due to lack of activity.

ashevtsov-wawa commented 2 years ago

Still happening with 1.247349.0b251399

jhnlsn commented 2 years ago

Hey Andrey, we believe that we have a fix for this issue in our latest release. It was related to data that was not paginated that was returned from the k8s api server.

Please keep an eye out for the 50 release, which should be coming in mid February

CraigHead commented 2 years ago

This is happening on non-EKS CWAgent as well. Specifically the Windows agent.

jhnlsn commented 2 years ago

@CraigHead could you describe your issue specifically, including related errors you are seeing. The issue listed in this ticket was related to contains being killed for OOM in EKS.

jhnlsn commented 2 years ago

This should be resolved with the latest version of the agent

ashevtsov-wawa commented 2 years ago

Still seeing pod being OOMKilled with 2500Mi memory limit when using public.ecr.aws/cloudwatch-agent/cloudwatch-agent:1.247350.0b251780 image @jhnlsn can you re-open this issue or should I create a new one?

ashevtsov-wawa commented 2 years ago

I removed memory limits to see how much memory it would use. I stopped this experiment after it consumed 20GB.

khanhntd commented 2 years ago

Hey @ashevtsov-wawa, After setting up the EKS Prometheus CloudWatchAgent with different version (e.g 1.247348, 1.247352) by following up this documentation, I was not able to detect the memory leak in EKS according to the memory usage and cpu usage

Therefore, for next course of action, would you help me in sharing the following information:

How you setup your environment and also is there any different between your setup with the public document that I have followed?
Instead of using v 1.247348, would you able to use this image public.ecr.aws/i7a4z2v8/cwagent-prometheus-metrics:353 (current image for v353) (there would be in exchange for memory in place of CPU Usage but I have not seen OOM so far with this image)?

nmamn commented 11 months ago

Hi,

Not sure if it is the same issue, but we faced a memory leak when the endpoint was unreachable. It seems CW agent would accumulate the connections / not clean everything ? and finally would get OOM killed after some time.

Fixing the network issue resolved our problem, but I believe it could be dealt with in the code, so that an unreachable endpoint does not end in an OOM.

thanks,

Nicolas

jefchien commented 11 months ago

@nmamn Can you provide some additional context into the issue you're seeing? Which version of the agent were you seeing this in? Were there any logs indicating that the agent was failing to reach the endpoint? It would help us debug the issue.

nar-git commented 2 weeks ago

We are facing a similar issue and reported here. Our agent is consuming more than 50Gi (limit) and getting OOMKilled

aws / amazon-cloudwatch-agent

Memory leak in CW agent prometheus 1.247348.0b251302 #264