Open strowk opened 3 years ago
I'm getting this in production, the get constantly times out, but works on another cluster.
Bump to this. Hitting the same issue.
Since it's not mentioned here, the value can be adjusted with by setting DD_ECS_METADATA_TIMEOUT
(in milliseconds)
And another log entry to make this easier to find:
error pulling from collector "ecs_fargate": Get "http://169.254.170.2/v2/metadata": dial tcp 169.254.170.2:80: i/o timeout (Client.Timeout exceeded while awaiting headers)
Adding to the thread...
We ran into this problem in us-east-1
while identical deployments in 4 other regions are unaffected. This blocks metrics delivery when it happens - indicating the EC2 metadata API slows down at times.
I am not sure why this is blocking any metrics from being sent as host tags should already be known to the agent, and the problems start outside of any deployment activity.
We're going to tweak the timeout, but increasing the default value to 1s or 2s would make sense considering this blocks all metrics.
Any idea what the fix might be for this? Running into exact same thing.
yup, same here...
Now, the same thing is happening in our environment. Is there a better way to handle this other than extending the health check timeout of datadog-agent ?
Describe what happened:
We keep getting messages such as
and
and
Describe what you expected: Timeout should be set to realistic value, which is enough most of the time (p99?).
Steps to reproduce the issue: Run datadog-agent on AWS ECS (Fargate) for several months. Eventually you will see errors in logs.
Additional environment details (Operating System, Cloud provider, etc): AWS ECS Fargate
Request to increase timeout to 5 second instead of default 0.5 seconds was already presented in https://github.com/DataDog/datadog-agent/issues/6758
I am aware of possibility to configure that timeout, but as we are not doing anything particular in our deployment, which would make metadata endpoint to work any differently, I believe that this problem is likely to be happening to any ECS (Fargate) deployment. Errors clutter our logs and produce useless alerts in addition to cause unnecessary load on metadata endpoint with retries.
I am questioning the currently set default and whether it was defined based on any knowledge of internal AWS workings or at least tests. I guess that the number was just taken at random (correct me if I am wrong). Our practice as well as that of author of https://github.com/DataDog/datadog-agent/issues/6758 , shows that 0.5 seconds is not enough, so maybe it would be better for everyone to increase it?