Closed antempus closed 1 month ago
We have experienced the same thing. To solve we pegged our version to an older release.
Hi @antempus / @coop-coop / @pkat, Thank you for reporting this issue! Since additional internal Agent debugging data on this would be required for us to investigate, do you each mind opening a support ticket for this? Support staff should be able to help in investigating this further and provide us with a better way to receive that additional info.
PS: Feel free to link this issue in the request as well.
We have confirmed that the same issue occurs across multiple services in our environment. It seems that the problem started after the version released on or after September 9.
The version where the memory increase has been observed is 7.57.0.
cat /opt/datadog-agent/run/version-history.json
{"entries":[{"version":"7.57.0","timestamp":"2024-09-12T07:02:40.291356833Z","install_method":{"tool":"docker","tool_version":"docker","installer_version":"docker"}}]}
The issue does not occur in 7.56.2.
cat /opt/datadog-agent/run/version-history.json
{"entries":[{"version":"7.56.2","timestamp":"2024-09-09T00:16:05.133238601Z","install_method":{"tool":"docker","tool_version":"docker","installer_version":"docker"}}]}
I have already sent this issue to support.
We are also seeing this issue in ECS Fargate across multiple apps. Timing also coincides with the Sep 9 release of the latest
docker image.
Similar issue experciend
Also experienced the same thing w/ ECS Fargate, running the agent as a "sidecar" container on the task. We were experiencing consistent errors where some traces would fail to be forwarded to the Datadog agent:
failed to send traces to Datadog Agent at http://localhost:8126/v0.4/traces
Exec'd into the dd agent container to inspect the logs and saw a handful of timeouts as well in /var/log/datadog/agent.log
if it helps.
2024-09-12 21:18:20 UTC | CORE | ERROR | (pkg/config/remote/service/service.go:469 in pollOrgStatus) | [Remote Config] Could not refresh Remote Config: failed to issue org data request: Get "https://config.datadoghq.com/api/v0.1/status": net/http: TLS handshake timeout
2024-09-12 21:19:42 UTC | CORE | ERROR | (comp/forwarder/defaultforwarder/worker.go:191 in process) | Error while processing transaction: error while sending transaction, rescheduling it: Post "https://7-57-0-app.agent.datadoghq.com/api/v2/series": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
We also decided to pin the previous version (v7.56.2). You can see the difference in resource usage around the time of deployment (17:30pm PT):
Hi everyone,
We're aware of the increased volume of issues related to ECS Fargate deployments you are reporting here and we're actively investigating it. We reverted the latest
tag for the Agent image to point to 7.56.2
to reduce the impact on further deployments using that tag. If your deployment depends on specific tags of the Agent image, the recommendation for now is to pin the version 7.56.2. If you use the latest
tag please ensure that the latest tag points to the same hash as 7.56.2 and redeploy your service(s). Once we deploy the fix release, we will again update the latest
tag and update this issue with the fix version.
Hello everyone,
We just released Agent 7.57.1 with a fix for this issue. We also updated the latest
tag to the fix version.
We are still getting the issue with agent v7.57.1. We are running it in ECS fargate task with 2 vCPU and 4 GB RAM. We have reserved 0.1 vCPU and put a hard limit of 256 MB on RAM for the datadog-agent container. Tasks are getting killed because of OOM. The average CPU and memory consumption for the killed tasks is around 60% and 80%. We checked datadog metrics, there was spike in CPU and memory usage when container started.
Not facing any issue by running tasks with older version (v7.54.1)
Hi @ajaydeopa, The graphs you posted do not appear to correlate to the originally reported issue's symptoms where the CPU and RSS increases linearly over long periods of time so the two are likely not related. I think it would be best to contact support for your specific problem unless there's other similar reports from this group.
Since the reported problem's fix has been released, we will close this issue but feel free to comment in the thread and/or reopen if the problem resurfaces.
Thank you all on your feedback and extremely valuable detailed reports!
@sgnn7 thanks for the updates
Agent Environment
working digest -
sha256:ca9e192f56e3d67fe7b34702353af02f2fcb277474bd3bb9fd3ecd8eab4f15d3
presumptive defective digestsha256:8fe7762c67af41e2fc27bf2aefbcb157fdc1e792092cf98988c3472c035e88a1
Describe what happened:
Recent AWS driven redeployments of ECS Fargates tasks that use a
datadog/agent
sidecar container with thelatest
docker image, digest above, has been slowly increasing overall memory usage to the point the entire task is forced to restart. It's most apparent on tasks that have an overall memory size of0.5
and.25 vCPU
but is observed on other services, though more slowly, up to our current upper bound of 2GB. We do have occurrences of tasks with differentdatadog/agent
containers running in the same service, so we are at least able to narrow down versions of the container that are not causing the task as a whole to cycle.Utilization visuals
The inflection point on memory for each is about the time the last deployment occurred with
lastest
tag for the presumptive digest2GB - ARM64
.5GB - AMD64
Describe what you expected: Tasks not recycling.
Steps to reproduce the issue: pending:
latest
as a side car with low memory configurationAdditional environment details (Operating System, Cloud provider, etc): AWS ECS - Fargate, Architectures
linux/X86_64
&linux/arm64
Our components of the task have pinned versions and have not changed between re/deployments.