aws / containers-roadmap

This is the public roadmap for AWS container services (ECS, ECR, Fargate, and EKS).
https://aws.amazon.com/about-aws/whats-new/containers/
Other
5.22k stars 321 forks source link

[ECS][request] Agent health check reports healthy when ECS shows agent disconnected #1071

Open jhmartin opened 4 years ago

jhmartin commented 4 years ago

Summary

The ECS agent healthcheck can report healthy even if the agent is no longer connected to the ECS control plane.

Description

The ECS agent showed as Connected: false in the ECS console. It continued to run and respond on network ports. Container credentials provided by the agent expired. The ECS-agent logs showed that the ACS connection recycle had stopped about 4 hours earlier. No errors were logged by the agent. The agent successfully logged calls to /v2/credentials.

Expected Behavior

An agent that has for whatever reason lost connection with the control plane will show as unhealthy and be recycled by systemd.

Observed Behavior

The agent reported as healthy.

Environment Details

Docker info:

Client:
 Debug Mode: false

Server:
 Containers: 24
  Running: 22
  Paused: 0
  Stopped: 2
 Images: 16
 Server Version: 19.03.6-ce
 Storage Driver: overlay2
  Backing Filesystem: xfs
  Supports d_type: true
  Native Overlay Diff: true
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Plugins:
  Volume: amazon-ecs-volume-plugin local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: ff48f57fc83a8c44cf4ad5d672424a98ba37ded6
 runc version: dc9208a3303feef5b3839f4323d9beb36df0a9dd
 init version: fec3683
 Security Options:
  seccomp
   Profile: default
 Kernel Version: 4.14.186-146.268.amzn2.x86_64
 Operating System: Amazon Linux 2
 OSType: linux
 Architecture: x86_64
 CPUs: 16
 Total Memory: 30.41GiB
 Name: ip-10-153-72-219
 ID: W23X:TVXI:IAFK:HGGR:52AK:4433:6ABW:ST4G:HZX3:SQ2R:5RC6:OU5I
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

metadata:

{
  "Cluster": "REDACTED",
  "ContainerInstanceArn": "arn:aws:ecs:us-west-2:REDACTED:container-instance/REDACTED",
  "Version": "Amazon ECS Agent - v1.43.0 (1ebf0604)"
}

OS:

NAME="Amazon Linux"
VERSION="2"
ID="amzn"
ID_LIKE="centos rhel fedora"
VERSION_ID="2"
PRETTY_NAME="Amazon Linux 2"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
HOME_URL="https://amazonlinux.com/"

df -h

Filesystem      Size  Used Avail Use% Mounted on
devtmpfs         16G     0   16G   0% /dev
tmpfs            16G  4.0K   16G   1% /dev/shm
tmpfs            16G  1.9M   16G   1% /run
tmpfs            16G     0   16G   0% /sys/fs/cgroup
/dev/nvme0n1p1   89G  7.3G   81G   9% /
tmpfs           512M     0  512M   0% /tmp
/dev/nvme1n1    200G  8.0G  192G   4% /var/lib/docker
SNIP containers

Node time was in sync.

shubham2892 commented 4 years ago

Hi @jhmartin, thanks for reaching out.

Agent health check currently is meant to check that Agent as a docker container is running healthy and it does not check for connection to backend. For health check, Agent makes a HEAD request to the endpoint -- http://localhost:51678/v1/metadata( more details here ) and checks that no error is returned for the request.

jhmartin commented 4 years ago

Ah in that case can this be a FR such that the metadata endpoint return the current value of the heartbeat timer, and return an error if that value exceeds some value? https://github.com/aws/amazon-ecs-agent/blob/478c4ec00a43b4713a0acc811cfd1e2d70307f63/agent/acs/handler/acs_handler.go#L484-L489 seems to be where this value is handled.

fierlion commented 4 years ago

"An agent that has for whatever reason lost connection with the control plane will show as unhealthy and be recycled by systemd."

A note on Expected Behavior:

Systemd is babysitting the ecs-init process itself, ie starting the process and restarting when the process fails. ecs-init is babysitting the ECS Agent container, and the ECS Agent container healthcheck (noted above) is focused solely on the health of the process and not the connection status.

It would be useful to understand better the use cases for having access to connection status from the ECS Agent directly. If anyone reading has further examples/suggested use cases or in general just wants to see this implemented, please comment or +1.

fierlion commented 4 years ago

transferring this to the containers-roadmap to give it more exposure.

dushyant8858 commented 1 year ago

I am also noticing similar behavior i.e DescribeContainerInstances API shows agent healthStatus=OK though ECS shows agent is disconnected...

1) Logged into the ECS Container instance using SSM and manually stopped the ECS Agent

sh-4.2$ sudo docker ps
CONTAINER ID   IMAGE                            COMMAND    CREATED        STATUS                  PORTS     NAMES
a6f7639f59ee   amazon/amazon-ecs-agent:latest   "/agent"   15 hours ago   Up 15 hours (healthy)             ecs-agent

sh-4.2$ sudo docker stop a6f7639f59ee
a6f7639f59ee

sh-4.2$ sudo docker ps -a
CONTAINER ID   IMAGE                            COMMAND    CREATED        STATUS                      PORTS     NAMES
a6f7639f59ee   amazon/amazon-ecs-agent:latest   "/agent"   16 hours ago   Exited (0) 37 minutes ago             ecs-agent

2) Container instance metadata not seen as agent is stopped

sh-4.2$ curl -s http://localhost:51678/v1/metadata | python -mjson.tool
No JSON object could be decoded

sh-4.2$ curl -s http://localhost:51678/v1/metadata

sh-4.2$ curl -v http://localhost:51678/v1/metadata
*   Trying 127.0.0.1:51678...
* connect to 127.0.0.1 port 51678 failed: Connection refused
*   Trying ::1:51678...
* connect to ::1 port 51678 failed: Connection refused
* Failed to connect to localhost port 51678 after 0 ms: Connection refused
* Closing connection 0
curl: (7) Failed to connect to localhost port 51678 after 0 ms: Connection refused

2) Describing the Container Instance to review healthStatus and agentConnected?

>$ aws ecs describe-container-instances \
       --cluster private-linux \
       --container-instances ef841c1dc2axxxxxxxx1d3686fe9f1dea \
       --region us-east-2 \
       --include CONTAINER_INSTANCE_HEALTH \
       --query 'containerInstances[].{containerInstanceArn: containerInstanceArn, healthStatus: healthStatus, agentConnected: agentConnected}' \
       --output json

[
    {
        "containerInstanceArn": "arn:aws:ecs:us-east-2:8xxxxxxxxxx0:container-instance/private-linux/ef841c1dc2axxxxxxxx1d3686fe9f1dea",
        "healthStatus": {
            "overallStatus": "OK",
            "details": [
                {
                    "type": "CONTAINER_RUNTIME",
                    "status": "OK",
                    "lastUpdated": "2022-11-08T09:19:47-05:00",
                    "lastStatusChange": "2022-11-07T17:54:57-05:00"
                }
            ]
        },
        "agentConnected": false
    }
]