aws / amazon-ecs-agent

Amazon Elastic Container Service Agent
http://aws.amazon.com/ecs/
Apache License 2.0
2.08k stars 612 forks source link

ecs-agent - Unable to discover poll endpoint #2169

Closed floriangrundig closed 5 years ago

floriangrundig commented 5 years ago

Hi all,

Summary

ECS-Agent logs show error 'unable to discover poll endpoint' and probably as consequence the container instance doesn't start/update new ecs tasks:

 [ERROR] tcs: unable to discover poll endpoint: RequestError: send request failed
                 caused by: Post https://ecs.eu-west-1.amazonaws.com/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
[INFO] Error from tcs; backing off: RequestError: send request failed
             caused by: Post https://ecs.eu-west-1.amazonaws.com/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

Description

Seems to be similar to #676.

Agent version 1.29.1 Docker version 18.06.1-ce

When running

aws --region eu-west-1  ecs discover-poll-endpoint --cluster <your-cluster> --container-instance <container-instance-arn>

we get the following output

{
    "endpoint": "https://ecs-a-4.eu-west-1.amazonaws.com/",
    "telemetryEndpoint": "https://ecs-t-4.eu-west-1.amazonaws.com/"
}

The cluster consists of three container instances - a fresh registered (terminated old and autoscaling created new) container instance doesn't have this issue ...

yumex93 commented 5 years ago

@floriangrundig In the agent log, does the error message show that both connecting to acs and tcs had issues or you only see the error message for tcs?

pritam620 commented 5 years ago

@floriangrundig In the agent log, does the error message show that both connecting to acs and tcs had issues or you only see the error message for tcs?

@yumex93 👇

2019-08-19T07:24:31Z [INFO] Done waiting; reconnecting to ACS
2019-08-19T07:24:51Z [ERROR] acs: unable to discover poll endpoint, err: RequestError: send request failed
caused by: Post https://ecs.eu-west-1.amazonaws.com/: dial tcp: lookup ecs.eu-west-1.amazonaws.com on 10.25.6.35:53: read udp 10.102.22.188:56563->10.25.6.35:53: i/o timeout
2019-08-19T07:24:51Z [INFO] Reconnecting to ACS in: 2m15.692563724s
floriangrundig commented 5 years ago

As my colleague showed - both are affected...

ubhattacharjya commented 5 years ago

Hi @floriangrundig ,

Can you enable debug logs for amazon ecs agent and send the debug logs to utsa@amazon.com?

adnxn commented 5 years ago

closing this issue since we havent received debug level logs. if you're able to repro with debug logging, please package the logs using the log collector and send the logs to ecs-agent-external at amazon dot com, and feel free to reopen the issue then.