Open jhmartin opened 4 years ago
Hi @jhmartin, thanks for reaching out.
Agent health check currently is meant to check that Agent as a docker container is running healthy and it does not check for connection to backend. For health check, Agent makes a HEAD request to the endpoint -- http://localhost:51678/v1/metadata
( more details here ) and checks that no error is returned for the request.
Ah in that case can this be a FR such that the metadata endpoint return the current value of the heartbeat timer, and return an error if that value exceeds some value? https://github.com/aws/amazon-ecs-agent/blob/478c4ec00a43b4713a0acc811cfd1e2d70307f63/agent/acs/handler/acs_handler.go#L484-L489 seems to be where this value is handled.
"An agent that has for whatever reason lost connection with the control plane will show as unhealthy and be recycled by systemd."
A note on Expected Behavior:
Systemd is babysitting the ecs-init
process itself, ie starting the process and restarting when the process fails.
ecs-init
is babysitting the ECS Agent container, and the ECS Agent container healthcheck (noted above) is focused solely on the health of the process and not the connection status.
It would be useful to understand better the use cases for having access to connection status from the ECS Agent directly. If anyone reading has further examples/suggested use cases or in general just wants to see this implemented, please comment or +1.
transferring this to the containers-roadmap
to give it more exposure.
I am also noticing similar behavior i.e DescribeContainerInstances
API shows agent healthStatus=OK though ECS shows agent is disconnected...
1) Logged into the ECS Container instance using SSM and manually stopped the ECS Agent
sh-4.2$ sudo docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
a6f7639f59ee amazon/amazon-ecs-agent:latest "/agent" 15 hours ago Up 15 hours (healthy) ecs-agent
sh-4.2$ sudo docker stop a6f7639f59ee
a6f7639f59ee
sh-4.2$ sudo docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
a6f7639f59ee amazon/amazon-ecs-agent:latest "/agent" 16 hours ago Exited (0) 37 minutes ago ecs-agent
2) Container instance metadata not seen as agent is stopped
sh-4.2$ curl -s http://localhost:51678/v1/metadata | python -mjson.tool
No JSON object could be decoded
sh-4.2$ curl -s http://localhost:51678/v1/metadata
sh-4.2$ curl -v http://localhost:51678/v1/metadata
* Trying 127.0.0.1:51678...
* connect to 127.0.0.1 port 51678 failed: Connection refused
* Trying ::1:51678...
* connect to ::1 port 51678 failed: Connection refused
* Failed to connect to localhost port 51678 after 0 ms: Connection refused
* Closing connection 0
curl: (7) Failed to connect to localhost port 51678 after 0 ms: Connection refused
2) Describing the Container Instance to review healthStatus and agentConnected?
>$ aws ecs describe-container-instances \
--cluster private-linux \
--container-instances ef841c1dc2axxxxxxxx1d3686fe9f1dea \
--region us-east-2 \
--include CONTAINER_INSTANCE_HEALTH \
--query 'containerInstances[].{containerInstanceArn: containerInstanceArn, healthStatus: healthStatus, agentConnected: agentConnected}' \
--output json
[
{
"containerInstanceArn": "arn:aws:ecs:us-east-2:8xxxxxxxxxx0:container-instance/private-linux/ef841c1dc2axxxxxxxx1d3686fe9f1dea",
"healthStatus": {
"overallStatus": "OK",
"details": [
{
"type": "CONTAINER_RUNTIME",
"status": "OK",
"lastUpdated": "2022-11-08T09:19:47-05:00",
"lastStatusChange": "2022-11-07T17:54:57-05:00"
}
]
},
"agentConnected": false
}
]
Summary
The ECS agent healthcheck can report healthy even if the agent is no longer connected to the ECS control plane.
Description
The ECS agent showed as Connected: false in the ECS console. It continued to run and respond on network ports. Container credentials provided by the agent expired. The ECS-agent logs showed that the ACS connection recycle had stopped about 4 hours earlier. No errors were logged by the agent. The agent successfully logged calls to /v2/credentials.
Expected Behavior
An agent that has for whatever reason lost connection with the control plane will show as unhealthy and be recycled by systemd.
Observed Behavior
The agent reported as healthy.
Environment Details
Docker info:
metadata:
OS:
df -h
Node time was in sync.