aws / containers-roadmap

This is the public roadmap for AWS container services (ECS, ECR, Fargate, and EKS).
https://aws.amazon.com/about-aws/whats-new/containers/
Other
5.21k stars 321 forks source link

[ECS] ECS Agent Fails To Register Container Instance - AMZ-Linux-2 - IPv6-only #1477

Open tbordovsky opened 3 years ago

tbordovsky commented 3 years ago

Summary

The ecs-agent on my container instance can't register with my ECS service because it can't connect over IPv6. I believe this is because the ecs endpoint doesn't support IPv6.

Description

I'm running a dual-stack setup in my private subnet, with private IPv4 addresses and public IPv6 addresses behind an egress-only internet gateway. When the ecs-agent starts, it attempts to register with the ECS service, but it can't connect so it eventually fails. Then my containers can never start.

On the other hand, if I put it in a public subnet it works fine. I assume this is because it can communicate with the ecs endpoint over a public IPv4 address.

Expected Behavior

The ecs-agent should be able to register with an ECS service over an IPv6 connection.

Observed Behavior

The ecs-agent cannot register with an ECS service over an IPv6 connection.

Environment Details

Some things i can confirm from the box.

I'm running Amazon-Linux-2 (ECS-Optimized).

$ cat /etc/os-release
NAME="Amazon Linux"
VERSION="2"
ID="amzn"
ID_LIKE="centos rhel fedora"
VERSION_ID="2"
PRETTY_NAME="Amazon Linux 2"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
HOME_URL="https://amazonlinux.com/"

The ecs agent installed and started correctly.

$ sudo systemctl status ecs
● ecs.service - Amazon Elastic Container Service - container agent
   Loaded: loaded (/usr/lib/systemd/system/ecs.service; enabled; vendor preset: disabled)
   Active: active (running) since Sat 2021-08-14 22:29:06 UTC; 3min 56s ago
     Docs: https://aws.amazon.com/documentation/ecs/
  Process: 4093 ExecStartPre=/usr/libexec/amazon-ecs-init pre-start (code=exited, status=0/SUCCESS)
 Main PID: 4207 (amazon-ecs-init)
    Tasks: 7
   Memory: 73.0M
   CGroup: /system.slice/ecs.service
           └─4207 /usr/libexec/amazon-ecs-init start

Supporting Log Snippets

Basically this just keeps happening over and over again.

$ less /var/log/ecs/ecs-init.log
...
2021-08-14T22:30:14Z [INFO] Starting Amazon Elastic Container Service Agent
2021-08-14T22:30:35Z [INFO] Agent exited with code 1
2021-08-14T22:30:35Z [WARN] ECS Agent failed to start, retrying in 4.235010051s
2021-08-14T22:30:39Z [INFO] Container name: /ecs-agent
2021-08-14T22:30:39Z [INFO] Removing existing agent container ID: daf1a3cf3a283c761fe038e08dbedcf7fbac7870e89a41c45120693b525ee13f
$ less /var/log/ecs/ecs-agent.log
...
level=error time=2021-08-14T22:33:21Z msg="Unable to register as a container instance with ECS: RequestError: send request failed\ncaused by: Post \"https://ecs.us-east-2.amazonaws.com/\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)" module=client.go
level=error time=2021-08-14T22:33:21Z msg="Error registering: RequestError: send request failed\ncaused by: Post \"https://ecs.us-east-2.amazonaws.com/\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)" module=agent.go

Similar problem with cloudformation cfn-signal.

$ less /var/log/cfn-init.log
2021-08-14 22:25:56,687 [DEBUG] Sleeping for 0.895379 seconds before retrying
2021-08-14 22:25:57,583 [DEBUG] Signaling resource ECSAutoScalingGroup in stack myStack with unique ID i-0d2b8edf184330cde and status SUCCESS
2021-08-14 22:26:57,584 [WARNING] Timeout of 60 seconds breached

I can ping google's ipv6 checkpoint from the instance.

$ ping6 ipv6.google.com
PING ipv6.google.com(ord37s18-in-x0e.1e100.net (2607:f8b0:4009:805::200e)) 56 data bytes
64 bytes from ord37s18-in-x0e.1e100.net (2607:f8b0:4009:805::200e): icmp_seq=1 ttl=96 time=18.3 ms
64 bytes from ord37s18-in-x0e.1e100.net (2607:f8b0:4009:805::200e): icmp_seq=2 ttl=96 time=18.4 ms
^C
--- ipv6.google.com ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms

But I cannot reach ecs.

$ ping6 https://ecs.us-east-2.amazonaws.com
ping: https://ecs.us-east-2.amazonaws.com: Name or service not known

Because they don't have a AAAA record.

$ dig AAAA https://ecs.us-east-2.amazonaws.com

; <<>> DiG 9.11.4-P2-RedHat-9.11.4-26.P2.amzn2.5.2 <<>> AAAA https://ecs.us-east-2.amazonaws.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 59003
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

Nor does cloudformation.

$ dig AAAA https://cloudformation.us-east-2.amazonaws.com

; <<>> DiG 9.11.4-P2-RedHat-9.11.4-26.P2.amzn2.5.2 <<>> AAAA https://cloudformation.us-east-2.amazonaws.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 55794
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

😿

fenxiong commented 3 years ago

Hi, The ecs service endpoint currently does not support IPv6 traffic, as your dig command shown. Since supporting this requires service side change and therefore is outside of scope of the ecs agent, I'm transferring the issue to the container roadmap to track.

tbordovsky commented 3 years ago

Thanks FX.

ianneub commented 1 year ago

I'm running into this issue now as well. With the upcoming IPv4 pricing changes this would be a good feature to have.

timzuiddam commented 9 months ago

Associated to: https://github.com/aws/containers-roadmap/issues/1340