DataDog / datadog-agent

Main repository for Datadog Agent
https://docs.datadoghq.com/
Apache License 2.0
2.86k stars 1.2k forks source link

[BUG] After 7.48.1 Version Release, Agent Memory Leak #22012

Open moo-on opened 9 months ago

moo-on commented 9 months ago

Agent Environment

inferring it as version 7.48.1. ( 7.48.1 : Release on: 2023-10-17) image

ecs.fargate.mem.usage metric has been experiencing a memory leak issue since version 7.48.1 release, which has raised suspicions about this version.

Describe what happened:

image

RSS memory appears normal while usage memory is experiencing a memory leak issue. The increase in usage memory corresponds to the aws.ecs.service.memory_utilization value, and in CloudWatch, ECS containers are being terminated with a kill signal once the container memory is completely exhausted.

Describe what you expected:

image

The usage memory should remain consistent like RSS memory, instead of showing a gradually increasing graph as it did before October 18th.

Steps to reproduce the issue: It appears to be gradually increasing over the usage period rather than being associated with specific actions, and our environment is as follows: Running in Fargate with a SideCar pattern, so it seems to be isolated to the DataDog Agent Container.

Additional environment details (Operating System, Cloud provider, etc): AWS Infrastructure: ECS Fargate, Fargate Docker Image in use: The issue persists regardless of the specific image used (various images, for example, eclipse-temurin:11-jre-centos7) && AWS Firelens.

dgdelahera commented 6 months ago

We are experiencing the same problem in AWS ECS Fargate

image

BonusLord commented 5 months ago

We are also having stability issues with our Fargate service due to what looks like a memory leak in the datadog sidecar container, which seemed to suddenly appear after a recent redeploy pulled in a newer version of the DD agent container.

Here's what the per-container memory usage looks like after the service has been running for a week or so: image

jessgoldq4 commented 2 weeks ago

Does anyone know of a version of the datadog-agent that is confirmed to not have this issue? It's reproducible on the latest (v7.57.2) cc @moo-on

jmayday commented 2 weeks ago

I also observed data leak (after upgrading dd agent from 7.47.1 to 7.57.2). On 20th Sep I've updated agent image and the uptime was more than 2 weeks. Since that time resources utilization was only getting higher. Image

Executing into container and running top command also does show suspiciously high resources usage; Image