Open moo-on opened 9 months ago
We are experiencing the same problem in AWS ECS Fargate
We are also having stability issues with our Fargate service due to what looks like a memory leak in the datadog sidecar container, which seemed to suddenly appear after a recent redeploy pulled in a newer version of the DD agent container.
Here's what the per-container memory usage looks like after the service has been running for a week or so:
Does anyone know of a version of the datadog-agent that is confirmed to not have this issue? It's reproducible on the latest (v7.57.2) cc @moo-on
I also observed data leak (after upgrading dd agent from 7.47.1 to 7.57.2). On 20th Sep I've updated agent image and the uptime was more than 2 weeks. Since that time resources utilization was only getting higher.
Executing into container and running top
command also does show suspiciously high resources usage;
Agent Environment
inferring it as version 7.48.1. ( 7.48.1 : Release on: 2023-10-17)
ecs.fargate.mem.usage metric has been experiencing a memory leak issue since version 7.48.1 release, which has raised suspicions about this version.
Describe what happened:
RSS memory appears normal while usage memory is experiencing a memory leak issue. The increase in usage memory corresponds to the aws.ecs.service.memory_utilization value, and in CloudWatch, ECS containers are being terminated with a kill signal once the container memory is completely exhausted.
Describe what you expected:
The usage memory should remain consistent like RSS memory, instead of showing a gradually increasing graph as it did before October 18th.
Steps to reproduce the issue: It appears to be gradually increasing over the usage period rather than being associated with specific actions, and our environment is as follows: Running in Fargate with a SideCar pattern, so it seems to be isolated to the DataDog Agent Container.
Additional environment details (Operating System, Cloud provider, etc): AWS Infrastructure: ECS Fargate, Fargate Docker Image in use: The issue persists regardless of the specific image used (various images, for example, eclipse-temurin:11-jre-centos7) && AWS Firelens.