DataDog / datadog-agent

Main repository for Datadog Agent
https://docs.datadoghq.com/
Apache License 2.0
2.86k stars 1.2k forks source link

Increase default timeout for ECS metadata request #9137

Open strowk opened 3 years ago

strowk commented 3 years ago

Describe what happened:

We keep getting messages such as

Cannot list containers via ecs_fargate: Get "http://169.254.170.2/v2/metadata": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

and

failed to get task metadata, not refreshing services - Get "http://169.254.170.2/v2/metadata": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

and

Unable to collect configurations from provider ecs: Get "http://169.254.170.2/v2/metadata": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

Describe what you expected: Timeout should be set to realistic value, which is enough most of the time (p99?).

Steps to reproduce the issue: Run datadog-agent on AWS ECS (Fargate) for several months. Eventually you will see errors in logs.

Additional environment details (Operating System, Cloud provider, etc): AWS ECS Fargate

Request to increase timeout to 5 second instead of default 0.5 seconds was already presented in https://github.com/DataDog/datadog-agent/issues/6758

I am aware of possibility to configure that timeout, but as we are not doing anything particular in our deployment, which would make metadata endpoint to work any differently, I believe that this problem is likely to be happening to any ECS (Fargate) deployment. Errors clutter our logs and produce useless alerts in addition to cause unnecessary load on metadata endpoint with retries.

I am questioning the currently set default and whether it was defined based on any knowledge of internal AWS workings or at least tests. I guess that the number was just taken at random (correct me if I am wrong). Our practice as well as that of author of https://github.com/DataDog/datadog-agent/issues/6758 , shows that 0.5 seconds is not enough, so maybe it would be better for everyone to increase it?

Igosuki commented 10 months ago

I'm getting this in production, the get constantly times out, but works on another cluster.

viraptor commented 9 months ago

Bump to this. Hitting the same issue.

viraptor commented 9 months ago

Since it's not mentioned here, the value can be adjusted with by setting DD_ECS_METADATA_TIMEOUT (in milliseconds)

And another log entry to make this easier to find:

error pulling from collector "ecs_fargate": Get "http://169.254.170.2/v2/metadata": dial tcp 169.254.170.2:80: i/o timeout (Client.Timeout exceeded while awaiting headers)
Unibozu commented 9 months ago

Adding to the thread...

We ran into this problem in us-east-1 while identical deployments in 4 other regions are unaffected. This blocks metrics delivery when it happens - indicating the EC2 metadata API slows down at times.

I am not sure why this is blocking any metrics from being sent as host tags should already be known to the agent, and the problems start outside of any deployment activity.

image

We're going to tweak the timeout, but increasing the default value to 1s or 2s would make sense considering this blocks all metrics.

Akamad007 commented 5 months ago

Any idea what the fix might be for this? Running into exact same thing.

koen-venly commented 4 months ago

yup, same here...

sho-he commented 2 weeks ago

Now, the same thing is happening in our environment. Is there a better way to handle this other than extending the health check timeout of datadog-agent ?