DataDog / datadog-agent

Main repository for Datadog Agent
https://docs.datadoghq.com/
Apache License 2.0
2.88k stars 1.21k forks source link

[BUG] AWS ECS - Increased memory usage over time #29319

Closed antempus closed 1 month ago

antempus commented 1 month ago

Agent Environment

working digest - sha256:ca9e192f56e3d67fe7b34702353af02f2fcb277474bd3bb9fd3ecd8eab4f15d3 presumptive defective digest sha256:8fe7762c67af41e2fc27bf2aefbcb157fdc1e792092cf98988c3472c035e88a1

Describe what happened:

Recent AWS driven redeployments of ECS Fargates tasks that use a datadog/agent sidecar container with the latest docker image, digest above, has been slowly increasing overall memory usage to the point the entire task is forced to restart. It's most apparent on tasks that have an overall memory size of 0.5 and .25 vCPU but is observed on other services, though more slowly, up to our current upper bound of 2GB. We do have occurrences of tasks with different datadog/agent containers running in the same service, so we are at least able to narrow down versions of the container that are not causing the task as a whole to cycle.

Utilization visuals

The inflection point on memory for each is about the time the last deployment occurred with lastest tag for the presumptive digest

2GB - ARM64

image

.5GB - AMD64

image

Describe what you expected: Tasks not recycling.

Steps to reproduce the issue: pending:

Additional environment details (Operating System, Cloud provider, etc): AWS ECS - Fargate, Architectureslinux/X86_64 & linux/arm64 Our components of the task have pinned versions and have not changed between re/deployments.

coop-coop commented 1 month ago

We have experienced the same thing. To solve we pegged our version to an older release.

image

sgnn7 commented 1 month ago

Hi @antempus / @coop-coop / @pkat, Thank you for reporting this issue! Since additional internal Agent debugging data on this would be required for us to investigate, do you each mind opening a support ticket for this? Support staff should be able to help in investigating this further and provide us with a better way to receive that additional info.

PS: Feel free to link this issue in the request as well.

naomichi-y commented 1 month ago

We have confirmed that the same issue occurs across multiple services in our environment. It seems that the problem started after the version released on or after September 9.

The version where the memory increase has been observed is 7.57.0.

cat /opt/datadog-agent/run/version-history.json
{"entries":[{"version":"7.57.0","timestamp":"2024-09-12T07:02:40.291356833Z","install_method":{"tool":"docker","tool_version":"docker","installer_version":"docker"}}]}

The issue does not occur in 7.56.2.

cat /opt/datadog-agent/run/version-history.json
{"entries":[{"version":"7.56.2","timestamp":"2024-09-09T00:16:05.133238601Z","install_method":{"tool":"docker","tool_version":"docker","installer_version":"docker"}}]}
Screenshot_2024-09-13_at_15_03_09

I have already sent this issue to support.

schmalzs commented 1 month ago

We are also seeing this issue in ECS Fargate across multiple apps. Timing also coincides with the Sep 9 release of the latest docker image.

image
ErasmusJW commented 1 month ago

image Similar issue experciend

kareemshahin commented 1 month ago

Also experienced the same thing w/ ECS Fargate, running the agent as a "sidecar" container on the task. We were experiencing consistent errors where some traces would fail to be forwarded to the Datadog agent:

failed to send traces to Datadog Agent at http://localhost:8126/v0.4/traces

Exec'd into the dd agent container to inspect the logs and saw a handful of timeouts as well in /var/log/datadog/agent.log if it helps.

2024-09-12 21:18:20 UTC | CORE | ERROR | (pkg/config/remote/service/service.go:469 in pollOrgStatus) | [Remote Config] Could not refresh Remote Config: failed to issue org data request: Get "https://config.datadoghq.com/api/v0.1/status": net/http: TLS handshake timeout

2024-09-12 21:19:42 UTC | CORE | ERROR | (comp/forwarder/defaultforwarder/worker.go:191 in process) | Error while processing transaction: error while sending transaction, rescheduling it: Post "https://7-57-0-app.agent.datadoghq.com/api/v2/series": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

We also decided to pin the previous version (v7.56.2). You can see the difference in resource usage around the time of deployment (17:30pm PT):

image
sgnn7 commented 1 month ago

Hi everyone, We're aware of the increased volume of issues related to ECS Fargate deployments you are reporting here and we're actively investigating it. We reverted the latest tag for the Agent image to point to 7.56.2 to reduce the impact on further deployments using that tag. If your deployment depends on specific tags of the Agent image, the recommendation for now is to pin the version 7.56.2. If you use the latest tag please ensure that the latest tag points to the same hash as 7.56.2 and redeploy your service(s). Once we deploy the fix release, we will again update the latest tag and update this issue with the fix version.

FlorentClarret commented 1 month ago

Hello everyone,

We just released Agent 7.57.1 with a fix for this issue. We also updated the latest tag to the fix version.

ajaydeopa commented 1 month ago

We are still getting the issue with agent v7.57.1. We are running it in ECS fargate task with 2 vCPU and 4 GB RAM. We have reserved 0.1 vCPU and put a hard limit of 256 MB on RAM for the datadog-agent container. Tasks are getting killed because of OOM. The average CPU and memory consumption for the killed tasks is around 60% and 80%. We checked datadog metrics, there was spike in CPU and memory usage when container started. cpu memory

Not facing any issue by running tasks with older version (v7.54.1)

sgnn7 commented 1 month ago

Hi @ajaydeopa, The graphs you posted do not appear to correlate to the originally reported issue's symptoms where the CPU and RSS increases linearly over long periods of time so the two are likely not related. I think it would be best to contact support for your specific problem unless there's other similar reports from this group.

sgnn7 commented 1 month ago

Since the reported problem's fix has been released, we will close this issue but feel free to comment in the thread and/or reopen if the problem resurfaces.

Thank you all on your feedback and extremely valuable detailed reports!

antempus commented 1 month ago

@sgnn7 thanks for the updates