aws / amazon-ecs-agent

Amazon Elastic Container Service Agent
http://aws.amazon.com/ecs/
Apache License 2.0
2.08k stars 612 forks source link

ECS agents stops and starts the tasks #1872

Closed dineshkm closed 4 years ago

dineshkm commented 5 years ago

Summary

We are experiencing an issue with one of the ECS cluster deployed using cloud formation in our production network. We have a service which runs two tasks. One of the tasks running in a container instance is stopped by ECS agent and started after some time.

Description

In the log, you can see container is stopped

2019-06-20T18:09:36Z [INFO] Error from tcs; backing off: websocket client: unable to dial ecs-t-6.us-east-1.amazonaws.com response: : write tcp 10.66.9.203:46356->10.51.176.5:8080: i/o timeout
2019-06-20T18:09:38Z [INFO] Connected to ACS endpoint
2019-06-20T18:09:40Z [INFO] Saving state! module="statemanager"
2019-06-20T18:09:40Z [INFO] Task engine [arn:aws:ecs:us-east-1:361190373704:task/63bae6e8-b975-4661-b67e-ae06b9e31327]: stopping container [DES-desg]

Can you please explain this behavior of ECS?

There is no change in container image definition and there are no errors in our docker container and the container was running fine, but this issue happens suddenly and consistently over the period of time.

Expected Behavior

ECS agent doesnt stop the container

Observed Behavior

ECS stops the container

Environment Details

Docker Server Version: 18.06.1-ce ECS Agent version 1.20.3

Supporting Log Snippets

2019-06-20T18:02:51Z [ERROR] Error getting message from ws backend: error: [websocket: close 1002 (protocol error): Channel long idle: No message is received, close the channel], messageType: [-1] 
2019-06-20T18:02:51Z [INFO] Error from tcs; backing off: websocket: close 1002 (protocol error): Channel long idle: No message is received, close the channel
2019-06-20T18:02:51Z [WARN] Error publishing metrics: write tcp 10.66.9.203:47724->10.51.177.169:8080: i/o timeout
2019-06-20T18:02:51Z [WARN] Error getting cpu stats, err: No data in the queue, container: 8133d8c0371ab05231a9eefd8c27deeb54311c81be5c131670536cbbcf493670
2019-06-20T18:02:51Z [WARN] Error publishing metrics: stats engine: no task metrics to report
2019-06-20T18:02:52Z [INFO] Establishing a Websocket connection to https://ecs-t-6.us-east-1.amazonaws.com/ws?cluster=DES-E1-USEZ&containerInstance=arn%3Aaws%3Aecs%3Aus-east-1%3A361190373704%3Acontainer-instance%2F575a2c1d-ff75-4cfc-8643-1ac0481f2f3a
2019-06-20T18:02:52Z [INFO] Connected to TCS endpoint
2019-06-20T18:02:52Z [WARN] Error getting cpu stats, err: No data in the queue, container: 8133d8c0371ab05231a9eefd8c27deeb54311c81be5c131670536cbbcf493670
2019-06-20T18:03:52Z [ERROR] Error getting message from ws backend: error: [websocket: close 1002 (protocol error): Channel long idle: No message is received, close the channel], messageType: [-1] 
2019-06-20T18:04:22Z [INFO] Error from tcs; backing off: websocket: close 1002 (protocol error): Channel long idle: No message is received, close the channel
2019-06-20T18:04:58Z [WARN] Error publishing metrics: write tcp 10.66.9.203:36502->10.51.176.207:8080: i/o timeout
2019-06-20T18:05:26Z [INFO] Managed task [arn:aws:ecs:us-east-1:361190373704:task/63bae6e8-b975-4661-b67e-ae06b9e31327]: task at steady state: RUNNING
2019-06-20T18:05:44Z [WARN] ACS Connection hasn't had any activity for too long; closing connection
2019-06-20T18:05:59Z [INFO] Disconnected from ACS
2019-06-20T18:06:13Z [INFO] Establishing a Websocket connection to https://ecs-t-6.us-east-1.amazonaws.com/ws?cluster=DES-E1-USEZ&containerInstance=arn%3Aaws%3Aecs%3Aus-east-1%3A361190373704%3Acontainer-instance%2F575a2c1d-ff75-4cfc-8643-1ac0481f2f3a
2019-06-20T18:06:35Z [WARN] Unable to set read deadline for websocket connection: set tcp 10.66.9.203:44018: use of closed network connection for https://ecs-a-6.us-east-1.amazonaws.com/ws?agentHash=0821fbc7&agentVersion=1.25.2&clusterArn=DES-E1-USEZ&containerInstanceArn=arn%3Aaws%3Aecs%3Aus-east-1%3A361190373704%3Acontainer-instance%2F575a2c1d-ff75-4cfc-8643-1ac0481f2f3a&dockerVersion=DockerVersion%3A+18.06.1-ce&sendCredentials=false&seqNum=1
2019-06-20T18:06:49Z [ERROR] Stopping redundant reads on closed network connection: https://ecs-a-6.us-east-1.amazonaws.com/ws?agentHash=0821fbc7&agentVersion=1.25.2&clusterArn=DES-E1-USEZ&containerInstanceArn=arn%3Aaws%3Aecs%3Aus-east-1%3A361190373704%3Acontainer-instance%2F575a2c1d-ff75-4cfc-8643-1ac0481f2f3a&dockerVersion=DockerVersion%3A+18.06.1-ce&sendCredentials=false&seqNum=1
2019-06-20T18:06:56Z [WARN] Unable to extend read deadline for ACS connection: set tcp 10.66.9.203:44018: use of closed network connection
2019-06-20T18:07:00Z [INFO] Managed task [arn:aws:ecs:us-east-1:361190373704:task/63bae6e8-b975-4661-b67e-ae06b9e31327]: redundant container state change. DES-desg to NONE, but already RUNNING
2019-06-20T18:07:00Z [WARN] Unable to set read deadline for websocket connection: set tcp 10.66.9.203:44018: use of closed network connection for https://ecs-a-6.us-east-1.amazonaws.com/ws?agentHash=0821fbc7&agentVersion=1.25.2&clusterArn=DES-E1-USEZ&containerInstanceArn=arn%3Aaws%3Aecs%3Aus-east-1%3A361190373704%3Acontainer-instance%2F575a2c1d-ff75-4cfc-8643-1ac0481f2f3a&dockerVersion=DockerVersion%3A+18.06.1-ce&sendCredentials=false&seqNum=1
2019-06-20T18:07:01Z [WARN] DockerGoClient: inactivity time exceeded timeout while retrieving stats for container 8133d8c0371ab05231a9eefd8c27deeb54311c81be5c131670536cbbcf493670
2019-06-20T18:07:02Z [INFO] Managed task [arn:aws:ecs:us-east-1:361190373704:task/63bae6e8-b975-4661-b67e-ae06b9e31327]: task at steady state: RUNNING
2019-06-20T18:07:02Z [ERROR] Stopping redundant reads on closed network connection: https://ecs-a-6.us-east-1.amazonaws.com/ws?agentHash=0821fbc7&agentVersion=1.25.2&clusterArn=DES-E1-USEZ&containerInstanceArn=arn%3Aaws%3Aecs%3Aus-east-1%3A361190373704%3Acontainer-instance%2F575a2c1d-ff75-4cfc-8643-1ac0481f2f3a&dockerVersion=DockerVersion%3A+18.06.1-ce&sendCredentials=false&seqNum=1
2019-06-20T18:07:02Z [INFO] Reconnecting to ACS in: 253.312033ms
2019-06-20T18:07:19Z [INFO] Establishing a Websocket connection to https://ecs-a-6.us-east-1.amazonaws.com/ws?agentHash=0821fbc7&agentVersion=1.25.2&clusterArn=DES-E1-USEZ&containerInstanceArn=arn%3Aaws%3Aecs%3Aus-east-1%3A361190373704%3Acontainer-instance%2F575a2c1d-ff75-4cfc-8643-1ac0481f2f3a&dockerVersion=DockerVersion%3A+18.06.1-ce&sendCredentials=false&seqNum=1
2019-06-20T18:07:56Z [WARN] Error creating a websocket client: dial tcp: i/o timeout
2019-06-20T18:08:02Z [ERROR] Error connecting to ACS: websocket client: unable to dial ecs-a-6.us-east-1.amazonaws.com response: : dial tcp: i/o timeout
2019-06-20T18:08:10Z [INFO] Reconnecting to ACS in: 428.558079ms
2019-06-20T18:08:57Z [INFO] Establishing a Websocket connection to https://ecs-a-6.us-east-1.amazonaws.com/ws?agentHash=0821fbc7&agentVersion=1.25.2&clusterArn=DES-E1-USEZ&containerInstanceArn=arn%3Aaws%3Aecs%3Aus-east-1%3A361190373704%3Acontainer-instance%2F575a2c1d-ff75-4cfc-8643-1ac0481f2f3a&dockerVersion=DockerVersion%3A+18.06.1-ce&sendCredentials=false&seqNum=1
2019-06-20T18:09:36Z [WARN] Error creating a websocket client: write tcp 10.66.9.203:46356->10.51.176.5:8080: i/o timeout
2019-06-20T18:09:36Z [ERROR] Error connecting to TCS: websocket client: unable to dial ecs-t-6.us-east-1.amazonaws.com response: : write tcp 10.66.9.203:46356->10.51.176.5:8080: i/o timeout
2019-06-20T18:09:36Z [INFO] Error from tcs; backing off: websocket client: unable to dial ecs-t-6.us-east-1.amazonaws.com response: : write tcp 10.66.9.203:46356->10.51.176.5:8080: i/o timeout
2019-06-20T18:09:38Z [INFO] Connected to ACS endpoint
2019-06-20T18:09:40Z [INFO] Saving state! module="statemanager"
2019-06-20T18:09:40Z [INFO] Task engine [arn:aws:ecs:us-east-1:361190373704:task/63bae6e8-b975-4661-b67e-ae06b9e31327]: stopping container [DES-desg]
2019-06-20T18:09:40Z [INFO] Managed task [arn:aws:ecs:us-east-1:361190373704:task/c838234c-e247-418d-9ffe-f3536231c187]: unable to create task state change event []: create task state change event api: status not recognized by ECS: NONE
2019-06-20T18:09:40Z [INFO] Managed task [arn:aws:ecs:us-east-1:361190373704:task/c838234c-e247-418d-9ffe-f3536231c187]: waiting for any previous stops to complete. Sequence number: 5
2019-06-20T18:09:41Z [INFO] Establishing a Websocket connection to https://ecs-t-6.us-east-1.amazonaws.com/ws?cluster=DES-E1-USEZ&containerInstance=arn%3Aaws%3Aecs%3Aus-east-1%3A361190373704%3Acontainer-instance%2F575a2c1d-ff75-4cfc-8643-1ac0481f2f3a
2019-06-20T18:09:43Z [INFO] Connected to TCS endpoint
2019-06-20T18:09:49Z [WARN] DockerGoClient: Unable to decode stats for container 8133d8c0371ab05231a9eefd8c27deeb54311c81be5c131670536cbbcf493670: context canceled
2019-06-20T18:09:53Z [INFO] Saving state! module="statemanager"
2019-06-20T18:10:33Z [WARN] Error getting cpu stats, err: No data in the queue, container: 8133d8c0371ab05231a9eefd8c27deeb54311c81be5c131670536cbbcf493670
2019-06-20T18:10:39Z [WARN] Error publishing metrics: stats engine: no task metrics to report
2019-06-20T18:10:50Z [INFO] Task engine [arn:aws:ecs:us-east-1:361190373704:task/63bae6e8-b975-4661-b67e-ae06b9e31327]: error transitioning container [DES-desg] to [STOPPED]: Could not transition to stopped; timed out after waiting 1m0s
2019-06-20T18:10:58Z [INFO] TCS Connection hasn't had any activity for too long; disconnecting
2019-06-20T18:11:06Z [INFO] DockerGoClient: error stopping container 8133d8c0371ab05231a9eefd8c27deeb54311c81be5c131670536cbbcf493670: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
2019-06-20T18:11:13Z [WARN] Error getting cpu stats, err: No data in the queue, container: 8133d8c0371ab05231a9eefd8c27deeb54311c81be5c131670536cbbcf493670
2019-06-20T18:11:20Z [WARN] Unable to set read deadline for websocket connection: set tcp 10.66.9.203:46394: use of closed network connection for https://ecs-t-6.us-east-1.amazonaws.com/
shubham2892 commented 5 years ago

Hey,

Sorry that you are having trouble with your service. We will need more agent and docker logs from the instance to debug this problem. You can use our logs collector to collect them and email them to me (shugy at amazon dot com). Thanks.

dineshkm commented 5 years ago

Hi, This issue occurred again, In the ECS events, i got:

service DES-G1-USE4-DESServiceDesignerGreen-1CMB7HJH3IAXI (instance i-0fd8707f689ff5c65) (port 8888) is unhealthy in target-group DES-G1-USE4-tg-desg due to (reason Request timed out)

The Task is stopped by saying reason: ELB health check got failed. But i am not sure why it happens suddenly. ELB is configured correctly, this setup is done in multiple regions but only in this region and on this particular instance we are seeing this issue. That too we can see this issue pattern it happens every day around 5.47AM UTC.

I can see from docker container logs that /health API's are responded properly.

Please help me to figure out this issue. I have mailed the logs collected.

shubham2892 commented 5 years ago

Hi @dineshkm

One of the task running in a container instance is stopped by ECS agent and started after some time

As seen from the logs, tasks that stops is 6aec105b-0272-4e45-b4dc-2567fb32dbaa and task that starts is 1651700b-8dbe-4690-a34d-59cecc751ad3. Which of the two tasks you think is restarting?

Also, the container c7f913031c232596a7d16f7851ea1c726ad1744471ee44c7d50c1ae6eeaae76b exits with exit code 137 which means that agent fails to stop the container and the container exits itself, probably something is wrong with the container itself.

dineshkm commented 5 years ago

Hi @shubham2892 Task is not restarting. As you mentioned 6aec105b-0272-4e45-b4dc-2567fb32dbaa is stopped and new task 651700b-8dbe-4690-a34d-59cecc751ad3 started after sometime. The question is why the task 6aec105b-0272-4e45-b4dc-2567fb32dbaa is moved from 'RUNNING' to 'STOPPED'?

And about your second point, why agent tries to stop the container c7f913031c232596a7d16f7851ea1c726ad1744471ee44c7d50c1ae6eeaae76b??

Could you give some possible root causes for this behaviour? what i need to check?

shubham2892 commented 5 years ago

@dineshkm you can start a task with same container definiton as c7f913031c232596a7d16f7851ea1c726ad1744471ee44c7d50c1ae6eeaae76b, kill the task from console and check the logs. You will be able to debug the root cause.

Carlos-E commented 5 years ago

"Request timed out" sounds like you have to check your ECS instance/task security group inbound rules, verify that you are allowing all ports for the load balancer and also having the target group health check traffic port option selected, the load balancer seems to not be able to do the health check properly.

dineshkm commented 5 years ago

Hi, Thanks for the response. But everything is configured properly and it works as expected. But we are seeing this problem suddenly and then it is not happening. For example, when it happened yesterday, we have seen the following logs in ECS agent

019-03-14T05:41:17Z [ERROR] acs: unable to discover poll endpoint, err: RequestError: send request failed caused by: Post https://ecs.us-east-1.amazonaws.com/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) 2019-03-14T05:41:17Z [INFO] Reconnecting to ACS in: 260.54724ms 2019-03-14T05:41:18Z [INFO] Establishing a Websocket connection to https://ecs-a-8.us-east-1.amazonaws.com/ws?agentHash=0821fbc7&agentVersion=1.25.2&clusterArn=DES-G1-USE4&containerInstanceArn=arn%3Aaws%3Aecs%3Aus-east-1%3A361190373704%3Acontainer-instance%2F48aa0d04-602a-45e4-a718-e8aa5a75d46e&dockerVersion=DockerVersion%3A+18.06.1-ce&sendCredentials=false&seqNum=1 2019-03-14T05:41:35Z [WARN] Error creating a websocket client: unexpected EOF 2019-03-14T05:41:35Z [ERROR] Error connecting to ACS: websocket client: unable to dial ecs-a-8.us-east-1.amazonaws.com response: : unexpected EOF 2019-03-14T05:41:35Z [INFO] Reconnecting to ACS in: 429.920227ms 2019-03-14T05:41:35Z [INFO] TCS Websocket connection closed for a valid reason 2019-03-14T05:41:35Z [INFO] Establishing a Websocket connection to https://ecs-t-8.us-east-1.amazonaws.com/ws?cluster=DES-G1-USE4&containerInstance=arn%3Aaws%3Aecs%3Aus-east-1%3A361190373704%3Acontainer-instance%2F48aa0d04-602a-45e4-a718-e8aa5a75d46e 2019-03-14T05:41:36Z [INFO] Establishing a Websocket connection to https://ecs-a-8.us-east-1.amazonaws.com/ws?agentHash=0821fbc7&agentVersion=1.25.2&clusterArn=DES-G1-USE4&containerInstanceArn=arn%3Aaws%3Aecs%3Aus-east-1%3A361190373704%3Acontainer-instance%2F48aa0d04-602a-45e4-a718-e8aa5a75d46e&dockerVersion=DockerVersion%3A+18.06.1-ce&sendCredentials=false&seqNum=1

Please let us know the solvable root cause for this. This issue is disturbing our production setup quite often.

dineshkm commented 5 years ago

@dineshkm you can start a task with same container definiton as c7f913031c232596a7d16f7851ea1c726ad1744471ee44c7d50c1ae6eeaae76b, kill the task from console and check the logs. You will be able to debug the root cause.

Actually my problem is, why task is killed by ECS?

dineshkm commented 5 years ago

Again today @ same time 15:41 we have seen the issue that our tasks are killed by ECS.

2019-03-16T05:40:53Z [INFO] ACS Websocket connection closed for a valid reason 2019-03-16T05:40:53Z [INFO] Establishing a Websocket connection to https://ecs-a-8.us-east-1.amazonaws.com/ws?agentHash=0821fbc7&agentVersion=1.25.2&clusterArn=DES-G1-USE4&containerInstanceArn=arn%3Aaws%3Aecs%3Aus-east-1%3A361190373704%3Acontainer-instance%2Fd6201673-f18f-4e13-8620-5bc5c8479c0a&dockerVersion=DockerVersion%3A+18.06.1-ce&sendCredentials=false&seqNum=1 2019-03-16T05:41:08Z [INFO] TCS Websocket connection closed for a valid reason 2019-03-16T05:41:08Z [INFO] Establishing a Websocket connection to https://ecs-t-8.us-east-1.amazonaws.com/ws?cluster=DES-G1-USE4&containerInstance=arn%3Aaws%3Aecs%3Aus-east-1%3A361190373704%3Acontainer-instance%2Fd6201673-f18f-4e13-8620-5bc5c8479c0a 2019-03-16T05:41:08Z [INFO] Connected to TCS endpoint 2019-03-16T05:41:23Z [WARN] Error creating a websocket client: read tcp 10.72.72.180:33468->10.72.78.46:8080: i/o timeout 2019-03-16T05:41:23Z [ERROR] Error connecting to ACS: websocket client: unable to dial ecs-a-8.us-east-1.amazonaws.com response: : read tcp 10.72.72.180:33468->10.72.78.46:8080: i/o timeout 2019-03-16T05:41:23Z [INFO] Reconnecting to ACS in: 252.379344ms 2019-03-16T05:41:23Z [INFO] Establishing a Websocket connection to https://ecs-a-8.us-east-1.amazonaws.com/ws?agentHash=0821fbc7&agentVersion=1.25.2&clusterArn=DES-G1-USE4&containerInstanceArn=arn%3Aaws%3Aecs%3Aus-east-1%3A361190373704%3Acontainer-instance%2Fd6201673-f18f-4e13-8620-5bc5c8479c0a&dockerVersion=DockerVersion%3A+18.06.1-ce&sendCredentials=false&seqNum=1

This is the error we are seeing from ECS Agent. Can any one in this support team take a look at this quickly to give some valid response that we can apply in our setup. Any help is greatly appreciated.

dineshkm commented 5 years ago

@shubham2892 @Carlos-E @aaithal

And again

2019-03-16T05:41:20Z [WARN] Error creating a websocket client: unexpected EOF 2019-03-16T05:41:20Z [ERROR] Error connecting to ACS: websocket client: unable to dial ecs-a-8.us-east-1.amazonaws.com response: : unexpected EOF 2019-03-16T05:41:20Z [INFO] Reconnecting to ACS in: 257.219932ms 2019-03-16T05:41:20Z [INFO] Establishing a Websocket connection to https://ecs-a-8.us-east-1.amazonaws.com/ws?agentHash=0821fbc7&agentVersion=1.25.2&clusterArn=DES-G1-USE4&containerInstanceArn=arn%3Aaws%3Aecs%3Aus-east-1%3A361190373704%3Acontainer-instance%2F48aa0d04-602a-45e4-a718-e8aa5a75d46e&dockerVersion=DockerVersion%3A+18.06.1-ce&sendCredentials=false&seqNum=1 2019-03-16T05:41:50Z [WARN] Error creating a websocket client: read tcp 10.72.72.43:38784->10.72.78.46:8080: i/o timeout 2019-03-16T05:41:50Z [ERROR] Error connecting to ACS: websocket client: unable to dial ecs-a-8.us-east-1.amazonaws.com response: : read tcp 10.72.72.43:38784->10.72.78.46:8080: i/o timeout 2019-03-16T05:41:50Z [INFO] Reconnecting to ACS in: 418.959179ms 2019-03-16T05:41:51Z [INFO] Establishing a Websocket connection to https://ecs-a-8.us-east-1.amazonaws.com/ws?agentHash=0821fbc7&agentVersion=1.25.2&clusterArn=DES-G1-USE4&containerInstanceArn=arn%3Aaws%3Aecs%3Aus-east-1%3A361190373704%3Acontainer-instance%2F48aa0d04-602a-45e4-a718-e8aa5a75d46e&dockerVersion=DockerVersion%3A+18.06.1-ce&sendCredentials=false&seqNum=1 2019-03-16T05:41:53Z [WARN] Error creating a websocket client: unexpected EOF 2019-03-16T05:41:53Z [ERROR] Error connecting to ACS: websocket client: unable to dial ecs-a-8.us-east-1.amazonaws.com response: : unexpected EOF 2019-03-16T05:41:53Z [INFO] Reconnecting to ACS in: 661.636566ms 2019-03-16T05:41:53Z [INFO] Establishing a Websocket connection to https://ecs-a-8.us-east-1.amazonaws.com/ws?agentHash=0821fbc7&agentVersion=1.25.2&clusterArn=DES-G1-USE4&containerInstanceArn=arn%3Aaws%3Aecs%3Aus-east-1%3A361190373704%3Acontainer-instance%2F48aa0d04-602a-45e4-a718-e8aa5a75d46e&dockerVersion=DockerVersion%3A+18.06.1-ce&sendCredentials=false&seqNum=1 2019-03-16T05:42:00Z [INFO] Connected to ACS endpoint 2019-03-16T05:42:02Z [INFO] Saving state! module="statemanager" 2019-03-16T05:42:02Z [INFO] Task engine [arn:aws:ecs:us-east-1:361190373704:task/abd961b8-47b4-4110-a9d2-85edca5798d7]: stopping container [DES-desg] 2019-03-16T05:42:02Z [INFO] Managed task [arn:aws:ecs:us-east-1:361190373704:task/288a8ab7-8d41-4f6a-b513-d4e93faf0335]: unable to create task state change event []: create task state change event api: status not recognized by ECS: NONE 2019-03-16T05:42:02Z [INFO] Managed task [arn:aws:ecs:us-east-1:361190373704:task/288a8ab7-8d41-4f6a-b513-d4e93faf0335]: waiting for any previous stops to complete. Sequence number: 21

If you see right after the error with connecting ECS, the task is stopped by ECS.

dineshkm commented 5 years ago

Again today at same time this issue happened and stopped the container

2019-03-18T05:40:38Z [INFO] ACS Websocket connection closed for a valid reason 2019-03-18T05:40:54Z [INFO] Establishing a Websocket connection to https://ecs-a-8.us-east-1.amazonaws.com/ws?agentHash=0821fbc7&agentVersion=1.25.2&clusterArn=DES-G1-USE4&containerInstanceArn=arn%3Aaws%3Aecs%3Aus-east-1%3A361190373704%3Acontainer-instance%2Fd6201673-f18f-4e13-8620-5bc5c8479c0a&dockerVersion=DockerVersion%3A+18.06.1-ce&sendCredentials=false&seqNum=1 2019-03-18T05:41:06Z [ERROR] Error getting message from ws backend: error: [websocket: close 1002 (protocol error): Channel long idle: No message is received, close the channel], messageType: [-1] 2019-03-18T05:41:06Z [INFO] Error from tcs; backing off: websocket: close 1002 (protocol error): Channel long idle: No message is received, close the channel 2019-03-18T05:41:07Z [INFO] Establishing a Websocket connection to https://ecs-t-8.us-east-1.amazonaws.com/ws?cluster=DES-G1-USE4&containerInstance=arn%3Aaws%3Aecs%3Aus-east-1%3A361190373704%3Acontainer-instance%2Fd6201673-f18f-4e13-8620-5bc5c8479c0a 2019-03-18T05:41:08Z [INFO] Connected to ACS endpoint 2019-03-18T05:41:08Z [INFO] Connected to TCS endpoint 2019-03-18T05:41:10Z [INFO] Saving state! module="statemanager" 2019-03-18T05:41:10Z [INFO] Task engine [arn:aws:ecs:us-east-1:361190373704:task/6f42390d-955f-4e46-a2a3-d8c37790fc22]: stopping container [DES-desg] 2019-03-18T05:41:16Z [INFO] Managed task [arn:aws:ecs:us-east-1:361190373704:task/74093d4e-5b31-4f68-9c89-7398ff7c9673]: unable to create task state change event []: create task state change event api: status not recognized by ECS: NONE 2019-03-18T05:41:16Z [INFO] Managed task [arn:aws:ecs:us-east-1:361190373704:task/74093d4e-5b31-4f68-9c89-7398ff7c9673]: waiting for any previous stops to complete. Sequence number: 20

hampsterx commented 5 years ago

we are also experiencing this as of about 2hrs ago. Terminated Instance and half hour after it was running/stable. Gets in same state..

2019-03-19T03:34:31Z [INFO] Container 5d6a710d4026cd0055088cd7314ef30075234ab6b43e7dc879838cc9f9a89235 is terminal, stopping stats collection
2019-03-19T03:34:31Z [INFO] Managed task [arn:aws:ecs:ap-southeast-2:030240982787:task/1c5fba77-053e-44ad-b6a8-226405286f76]: wait over; ready to move towards status: RUNNING
2019-03-19T03:34:31Z [INFO] Creating cgroup /ecs/1c5fba77-053e-44ad-b6a8-226405286f76
2019-03-19T03:34:31Z [INFO] Task engine [arn:aws:ecs:ap-southeast-2:030240982787:task/1c5fba77-053e-44ad-b6a8-226405286f76]: pulling container numbers-ipms-approved-ports-sync concurrently
2019-03-19T03:34:31Z [INFO] Task engine [arn:aws:ecs:ap-southeast-2:030240982787:task/1c5fba77-053e-44ad-b6a8-226405286f76]: Recording timestamp for starting image pulltime: 2019-03-19 03:34:31.10855976 +0000 UTC m=+4494.801563696
2019-03-19T03:34:31Z [INFO] Managed task [arn:aws:ecs:ap-southeast-2:030240982787:task/36a414d5-d8b5-42a2-a43d-3985645fa9da]: redundant container state change. sqs-webhook-server to STOPPED, but already STOPPED
2019-03-19T03:34:31Z [INFO] Managed task [arn:aws:ecs:ap-southeast-2:030240982787:task/36a414d5-d8b5-42a2-a43d-3985645fa9da]: redundant container state change. sqs-webhook-server to STOPPED, but already STOPPED
2019-03-19T03:34:32Z [INFO] DockerGoClient: Unable to retrieve stats for container fcd5ffa1e001d47fcbc210becbe49fef3d7d6ff75de53eb34953b4f26d06ca96: inactivity time exceeded timeout
2019-03-19T03:34:32Z [INFO] Updating container reference numbers-ipms-approved-ports-sync in Image State - sha256:6f6841fe7da5af16afc219a634db4d607fd251a2ed4553c6326ea4bbbc15d315
2019-03-19T03:34:32Z [INFO] Saving state! module="statemanager"
2019-03-19T03:34:35Z [INFO] DockerGoClient: Unable to retrieve stats for container d980c33c869087d1fa301f3449733cfe81db064f0dca312386bb710bdd2be272: inactivity time exceeded timeout
2019-03-19T03:34:35Z [INFO] DockerGoClient: Unable to retrieve stats for container 423ddc5f777ec748202c80db044450e6308f20025f8c2d72b9e0136aa8f52bf9: inactivity time exceeded timeout
2019-03-19T03:34:35Z [INFO] Task engine [arn:aws:ecs:ap-southeast-2:030240982787:task/1c5fba77-053e-44ad-b6a8-226405286f76]: Finished pulling container 030240982787.dkr.ecr.ap-southeast-2.amazonaws.com/devoli/numbers in 4.175102935s
2019-03-19T03:34:35Z [INFO] Task engine [arn:aws:ecs:ap-southeast-2:030240982787:task/1c5fba77-053e-44ad-b6a8-226405286f76]: creating container: numbers-ipms-approved-ports-sync
2019-03-19T03:34:35Z [INFO] Task engine [arn:aws:ecs:ap-southeast-2:030240982787:task/1c5fba77-053e-44ad-b6a8-226405286f76]: created container name mapping for task:  numbers-ipms-approved-ports-sync -> ecs-numbers-ipms-approved-ports-sync-9-num
 ./ecs-logs-collector.sh 
Trying to check if the script is running as root ... ok
Trying to resolve instance-id ... ok
Trying to collect system information ... ok
Trying to check disk space usage ... ok
Trying to collect common operating system logs ... ok
Trying to collect kernel logs ... ok
Trying to get mount points and volume information ... ok
Trying to check SELinux status ... ok
Trying to get iptables list ... ok
Trying to detect installed packages ... ok
Trying to detect active system services list ... ok
Trying to gather Docker daemon information ... Timed out, ignoring "docker ps --all --no-truc output " 
ok
Trying to inspect all Docker containers ... 

seems to be timing out sigh

dineshkm commented 5 years ago

@shubham2892 @Carlos-E @aaithal Why there is no one responding to my comments? This is really critical in our environment and affecting production systems. And this happened again today. Can someone here take a look at my problem and help me debug this issue.

Today's log:

2019-03-19T05:40:40Z [INFO] Connected to ACS endpoint 2019-03-19T05:41:09Z [INFO] TCS Connection hasn't had any activity for too long; disconnecting 2019-03-19T05:41:09Z [ERROR] Error getting message from ws backend: error: [read tcp 10.72.72.180:50924->10.72.78.46:8080: use of closed network connection], messageType: [-1] 2019-03-19T05:41:09Z [INFO] Error from tcs; backing off: read tcp 10.72.72.180:50924->10.72.78.46:8080: use of closed network connection 2019-03-19T05:41:10Z [INFO] Establishing a Websocket connection to https://ecs-t-8.us-east-1.amazonaws.com/ws?cluster=DES-G1-USE4&containerInstance=arn%3Aaws%3Aecs%3Aus-east-1%3A361190373704%3Acontainer-instance%2Fd6201673-f18f-4e13-8620-5bc5c8479c0a 2019-03-19T05:41:11Z [INFO] Connected to TCS endpoint 2019-03-19T05:42:22Z [INFO] Managed task [arn:aws:ecs:us-east-1:361190373704:task/ec0fe35b-2bcf-42ed-b0a9-b0a7a5766df8]: task at steady state: RUNNING 2019-03-19T05:42:22Z [INFO] Managed task [arn:aws:ecs:us-east-1:361190373704:task/ec0fe35b-2bcf-42ed-b0a9-b0a7a5766df8]: redundant container state change. DES-desg to RUNNING, but already RUNNING 2019-03-19T05:42:22Z [INFO] Managed task [arn:aws:ecs:us-east-1:361190373704:task/ec0fe35b-2bcf-42ed-b0a9-b0a7a5766df8]: task at steady state: RUNNING 2019-03-19T05:43:25Z [INFO] Saving state! module="statemanager" 2019-03-19T05:43:25Z [INFO] Task engine [arn:aws:ecs:us-east-1:361190373704:task/ec0fe35b-2bcf-42ed-b0a9-b0a7a5766df8]: stopping container [DES-desg] 2019-03-19T05:43:35Z [INFO] Saving state! module="statemanager" 2019-03-1

Carlos-E commented 5 years ago

I really don't know what's happening, have you tried updating the ECS agent to the latest version? you can do it from the console, another option is to try and use Fartage to see if it handles it differently, maybe? Also, what about the containers logs? what do they say? Also Also, someone had a similar problem in i think #1236 he solved it adding more CPU or RAM.

petderek commented 5 years ago

@dineshkm Its tricky to determine what is happening from the logs you are providing. There isn't anything regarding the task lifecycle.

2019-03-19T05:41:09Z [ERROR] Error getting message from ws backend: error: [read tcp 10.72.72.180:50924->10.72.78.46:8080: use of closed network connection], messageType: [-1]

Lines like this typically don't indicate that something is wrong. The websocket connections close and reset periodically. Agent itself won't terminate tasks / containers if it loses a connection.

The earlier logs you provided (the connect timeouts) typically mean that there is a security group misconfigured. You said your config hasn't changed -- so we can investigate on our end and see if theres other potential ways to get that same error message.

Also, what about the containers logs? what do they say?

This is another place you should look. See if your application is exiting with any errors. Also, what is the reported exitCode when you call describe task on a stopped task?

dineshkm commented 5 years ago

@Carlos-E

I really don't know what's happening, have you tried updating the ECS agent to the latest version? you can do it from the console, another option is to try and use Fartage to see if it handles it differently, maybe?

No, i haven't tried updating ECS agent. This is same ECS agent version we are using in other 5 more regions but the problem is with this region only.

Also, what about the containers logs? what do they say?

I don't see any errors in container logs. I can see this error in ECS agent log jut before ECS decides to stop the container.

Also Also, someone had a similar problem in i think #1236 he solved it adding more CPU or RAM.

We don't see high CPU or memory utilisation on this instance. do you want me add more CPU or RAM to EC2 instance or increase softlimit of task definition?

dineshkm commented 5 years ago

@petderek Thanks for the response.

Its tricky to determine what is happening from the logs you are providing. There isn't anything regarding the task lifecycle.

2019-03-19T05:41:09Z [ERROR] Error getting message from ws backend: error: [read tcp 10.72.72.180:50924->10.72.78.46:8080: use of closed network connection], messageType: [-1]

Lines like this typically don't indicate that something is wrong. The websocket connections close and reset periodically. Agent itself won't terminate tasks / containers if it loses a connection.

But this is the only error in the logs we can see before the task getting stopped.

The earlier logs you provided (the connect timeouts) typically mean that there is a security group misconfigured. You said your config hasn't changed -- so we can investigate on our end and see if theres other potential ways to get that same error message.

I have logs collected by ECS logs collector right after this issue happened. Does that help you to narrow down the issue.

Also, what about the containers logs? what do they say?

I dont see any error in container logs only that it stopped suddenly.

This is another place you should look. See if your application is exiting with any errors. Also, what is the reported exitCode when you call describe task on a stopped task?

I can see in the stopped Task console, it says Stopped Reason: Task failed ELB health checks (target group id). But this is the same configuration we have all the time but this issue happens daily at 5.41 UTC. Is there any other place i should look into?

Carlos-E commented 5 years ago

If you don't have a hard limit in your task definition i think it won't make any difference, if you do, try increasing it then.

If the ECS agent is not updated, then try updating it, see if something changes.

Is there a spike in the ELB requests at that hour?

I bring the topic of Fargate once again, it's pricier but you discard instance troubleshooting.

Also if the issue exist only in one region and you are certain that everything is configured identically in every region, that's Amazon's fault and you should just send a ticket or something.

But...i think one or several of your containers(if you have more than one) are just dying not cuz of ECS but some sneaking misconfiguration or error.

The key thing in there is the fact that it happens every day at the same hour right? Try troubleshooting around that fact.

hampsterx commented 5 years ago

containers still failing this morning, looking at ecs-logs-collector.sh script (it did eventually complete)

docker log I see lots of these msgs

Mar 19 03:46:50 ip-172-16-128-154 dockerd[3706]: time="2019-03-19T03:46:50.636602781Z" level=error msg="Handler for GET /v1.21/containers/json returned error: write unix /var/run/docker.sock->@: write: broken pipe"
...
Mar 19 03:50:09 ip-172-16-128-154 dockerd[3706]: time="2019-03-19T03:50:09.009703188Z" level=warning msg="Health check for container 3cbf82562afd40a34e8e13ca366f922031685d171d53b98adf8149875fc23171 error: context deadline exceeded: unknown"

My container health checks for the containers on this cluster are simply:

[ "CMD-SHELL", "exit 0" ]

and ecs logs..

2019-03-19T03:00:55Z [INFO] Managed task [arn:aws:ecs:ap-southeast-2:030240982787:task/506c5494-e7fb-4b8c-af6e-1d463a72a9c6]: task at steady state: RUNNING
2019-03-19T03:00:56Z [INFO] Saving state! module="statemanager"
2019-03-19T03:00:58Z [WARN] DockerGoClient: no container healthcheck data returned by Docker
 docker version
Client:
 Version:           18.06.1-ce

Note also this cluster (scale=1) has been running fine for past couple of months, only started happening late yesterday. I terminated instance so scaling group would create another to ensure it has latest ecs agent running but still issue is happening. Just removed health check on one service task def and it's up now for 5m whereas the other 4 containers are "Up About a minute (unhealthy)" or "Up XX seconds (health: starting)".

@Carlos-E any ideas or suggestions for further diagnosing? thnx

seems same issue as #1489

Carlos-E commented 5 years ago

Actually, just had an issue myself, all my task died(an error that we are trying to fix), and wouldn't start up again, had to spin up and additional EC2 instance, anyway.

@hampsterx Are you using CloudWatch logs? or some external logging application?, it is easier to just look at the logs from one of those. If you are not, see awslogs.

Man, i think ECS is just being suicidal or something. But being serious, i'd have to take a good look to give you and educated answer or guess. I'm not an expert in this matter but i'd gladly lend you a hand.

hampsterx commented 5 years ago

@Carlos-E I am using cloudwatch agent to get file system logs (eg cloud init, etc) and aws log driver for containers. I have a lambda (via kinesis subscriptions on the log groups) to put those into ES and kibana for searching, works great. I have updated the cloudwatch agent config to include the ecs-agent log as well now so I can keep an eye on it.

https://github.com/moby/moby/issues/22221

One thing different on this cluster (than my other ones which are running FINE) is I have a couple of the containers in Daemon mode which is also mentioned in #1489

The load on the EC2 instance is about 45% CPU, 40% MEM used.

One poster (2018 - Dec) says:

I am getting this issue under heavy stdout loads

However the stdout of the containers is minimal, just a small amount of structured logging so can't be that issue.

On #1489 poster (2018 - Aug):

Seems to happen with older versions of the ECS AMI too. I wonder if there's something weird going on with DataDog with scheduling=DAEMON. I'm going to try just running it outside of Docker on the EC2 instance

I just upgraded to latest AMI (ap-southeast-2 - ami-0d28e5e0f13248294)

Same situation.

However, deploying containers without using DAEMON and they are fine!

So, seems like something has changed recently with ecs-agent when using DAEMON + Health Checks?

@dineshkm does this apply to you?

The only reason I used daemon is basically because these containers are doing sync tasks etc so only need one instance running. Basically I have this cluster with scale=1 so using DAEMON is not strictly necessary I guess as I can just set desired=1 and minimumHealthyPercent=0% so on update it will kill old container first before spinning up new one?

Ugh, spoke too soon. Still failing, back to disabling health checks..

hampsterx commented 5 years ago

ok further update before I do some actual work for the day :)

Turns out I had one container that was exiting (raise NotImplementedError haha) shortly after startup which I think is the trigger for everything. The container instance is t2.micro (only need small amount of cpu/mem). I notice the CPU Credit balance is at zero, I am thinking perhaps this is causing docker to fail with those health checks?

I have to say, the ecs-agent needs some backoff behaviour. Container restarts over last 24hrs. ugh, no wonder CPU credits drained..

ecs_starts_24h

Carlos-E commented 5 years ago

I'm pretty sure it has nothing to do with CPU Credits itself, the instance'd be painfully slow tho, that might be and issue, maybe the instance being too slow will have some difficulty responding to the health checks, but i don't think so.

dineshkm commented 5 years ago

@Carlos-E

If you don't have a hard limit in your task definition i think it won't make any difference, if you do, try increasing it then.

Yes, we don't have hard limit set.

If the ECS agent is not updated, then try updating it, see if something changes.

We don't want to update ECS agent, since it is working without any issues in other region.

Is there a spike in the ELB requests at that hour?

No, nothing unusual in the requests.

I bring the topic of Fargate once again, it's pricier but you discard instance troubleshooting.

Currently it is not an option for us.

Also if the issue exist only in one region and you are certain that everything is configured identically in every region, that's Amazon's fault and you should just send a ticket or something.

Do we need to raise the ticket in AWS support from our account?

But...i think one or several of your containers(if you have more than one) are just dying not cuz of ECS but some sneaking misconfiguration or error.

But i dont see any errors in container logs. Also, we have 3 ec2 instances and totally 3 tasks of same docker running, one at each. Here 2 in us-east-1a AZ and one in us-east-1c AZ. The problem is with the 2 instances in us-east-1a. Other instance is not facing this issue, it is working as expected. Does that something i need to check?

The key thing in there is the fact that it happens every day at the same hour right? Try troubleshooting around that fact.

Yes, i get only this error in ECS agent log and ECS decides to stop the container. Also this error occurs only at the specific time 5.41 UTC daily.

Instance 1:

2019-03-20T05:41:25Z [ERROR] Error connecting to ACS: websocket client: unable to dial ecs-a-8.us-east-1.amazonaws.com response: : unexpected EOF
2019-03-20T05:41:25Z [INFO] Reconnecting to ACS in: 282.334782ms

Instance 2:

2019-03-20T05:40:20Z [INFO] Establishing a Websocket connection to https://ecs-a-8.us-east-1.amazonaws.com/ws?agentHash=0821fbc7&agentVersion=1.25.2&clusterArn=DES-G1-USE4&containerInstanceArn=arn%3Aaws%3Aecs%3Aus-east-1%3A361190373704%3Acontainer-instance%2F48aa0d04-602a-45e4-a718-e8aa5a75d46e&dockerVersion=DockerVersion%3A+18.06.1-ce&sendCredentials=false&seqNum=1
2019-03-20T05:40:50Z [WARN] Error creating a websocket client: read tcp 10.72.72.43:56398->10.72.78.46:8080: i/o timeout
2019-03-20T05:40:50Z [ERROR] Error connecting to ACS: websocket client: unable to dial ecs-a-8.us-east-1.amazonaws.com response: : read tcp 10.72.72.43:56398->10.72.78.46:8080: i/o timeout
2019-03-20T05:40:50Z [INFO] Reconnecting to ACS in: 254.521257ms
2019-03-20T05:40:50Z [INFO] Establishing a Websocket connection to https://ecs-a-8.us-east-1.amazonaws.com/ws?agentHash=0821fbc7&agentVersion=1.25.2&clusterArn=DES-G1-USE4&containerInstanceArn=arn%3Aaws%3Aecs%3Aus-east-1%3A361190373704%3Acontainer-instance%2F48aa0d04-602a-45e4-a718-e8aa5a75d46e&dockerVersion=DockerVersion%3A+18.06.1-ce&sendCredentials=false&seqNum=1
2019-03-20T05:41:20Z [WARN] Error creating a websocket client: read tcp 10.72.72.43:56736->10.72.79.241:8080: i/o timeout
2019-03-20T05:41:20Z [ERROR] Error connecting to ACS: websocket client: unable to dial ecs-a-8.us-east-1.amazonaws.com response: : read tcp 10.72.72.43:56736->10.72.79.241:8080: i/o timeout
2019-03-20T05:41:20Z [INFO] Reconnecting to ACS in: 393.574639ms
2019-03-20T05:41:20Z [INFO] Establishing a Websocket connection to https://ecs-a-8.us-east-1.amazonaws.com/ws?agentHash=0821fbc7&agentVersion=1.25.2&clusterArn=DES-G1-USE4&containerInstanceArn=arn%3Aaws%3Aecs%3Aus-east-1%3A361190373704%3Acontainer-instance%2F48aa0d04-602a-45e4-a718-e8aa5a75d46e&dockerVersion=DockerVersion%3A+18.06.1-ce&sendCredentials=false&seqNum=1
2019-03-20T05:41:24Z [INFO] Connected to ACS endpoint
2019-03-20T05:43:25Z [INFO] Saving state! module="statemanager"
2019-03-20T05:43:25Z [INFO] Task engine [arn:aws:ecs:us-east-1:361190373704:task/3c87a313-25f1-445e-929c-7af826bbfc4f]: stopping container [DES-desg]
2019-03-20T05:43:34Z [INFO] Managed task [arn:aws:ecs:us-east-1:361190373704:task/f32f12f2-aff3-4f0d-895a-81c339dda4ed]: unable to create task state change event []: create task state change event api: status not recognized by ECS: NONE
2019-03-20T05:43:34Z [INFO] Managed task [arn:aws:ecs:us-east-1:361190373704:task/f32f12f2-aff3-4f0d-895a-81c339dda4ed]: waiting for any previous stops to complete. Sequence number: 33

Can you elaborate why this error happens in ECS? I am not sure what else i should check around this time. Please help.

dineshkm commented 5 years ago

Hi guys, @shubham2892 @Carlos-E @aaithal

We are getting the issue again

Log snippet from ECS Agent, shows container is stopped by ECS.

2019-06-20T18:02:51Z [ERROR] Error getting message from ws backend: error: [websocket: close 1002 (protocol error): Channel long idle: No message is received, close the channel], messageType: [-1] 
2019-06-20T18:02:51Z [INFO] Error from tcs; backing off: websocket: close 1002 (protocol error): Channel long idle: No message is received, close the channel
2019-06-20T18:02:51Z [WARN] Error publishing metrics: write tcp 10.66.9.203:47724->10.51.177.169:8080: i/o timeout
2019-06-20T18:02:51Z [WARN] Error getting cpu stats, err: No data in the queue, container: 8133d8c0371ab05231a9eefd8c27deeb54311c81be5c131670536cbbcf493670
2019-06-20T18:02:51Z [WARN] Error publishing metrics: stats engine: no task metrics to report
2019-06-20T18:02:52Z [INFO] Establishing a Websocket connection to https://ecs-t-6.us-east-1.amazonaws.com/ws?cluster=DES-E1-USEZ&containerInstance=arn%3Aaws%3Aecs%3Aus-east-1%3A361190373704%3Acontainer-instance%2F575a2c1d-ff75-4cfc-8643-1ac0481f2f3a
2019-06-20T18:02:52Z [INFO] Connected to TCS endpoint
2019-06-20T18:02:52Z [WARN] Error getting cpu stats, err: No data in the queue, container: 8133d8c0371ab05231a9eefd8c27deeb54311c81be5c131670536cbbcf493670
2019-06-20T18:03:52Z [ERROR] Error getting message from ws backend: error: [websocket: close 1002 (protocol error): Channel long idle: No message is received, close the channel], messageType: [-1] 
2019-06-20T18:04:22Z [INFO] Error from tcs; backing off: websocket: close 1002 (protocol error): Channel long idle: No message is received, close the channel
2019-06-20T18:04:58Z [WARN] Error publishing metrics: write tcp 10.66.9.203:36502->10.51.176.207:8080: i/o timeout
2019-06-20T18:05:26Z [INFO] Managed task [arn:aws:ecs:us-east-1:361190373704:task/63bae6e8-b975-4661-b67e-ae06b9e31327]: task at steady state: RUNNING
2019-06-20T18:05:44Z [WARN] ACS Connection hasn't had any activity for too long; closing connection
2019-06-20T18:05:59Z [INFO] Disconnected from ACS
2019-06-20T18:06:13Z [INFO] Establishing a Websocket connection to https://ecs-t-6.us-east-1.amazonaws.com/ws?cluster=DES-E1-USEZ&containerInstance=arn%3Aaws%3Aecs%3Aus-east-1%3A361190373704%3Acontainer-instance%2F575a2c1d-ff75-4cfc-8643-1ac0481f2f3a
2019-06-20T18:06:35Z [WARN] Unable to set read deadline for websocket connection: set tcp 10.66.9.203:44018: use of closed network connection for https://ecs-a-6.us-east-1.amazonaws.com/ws?agentHash=0821fbc7&agentVersion=1.25.2&clusterArn=DES-E1-USEZ&containerInstanceArn=arn%3Aaws%3Aecs%3Aus-east-1%3A361190373704%3Acontainer-instance%2F575a2c1d-ff75-4cfc-8643-1ac0481f2f3a&dockerVersion=DockerVersion%3A+18.06.1-ce&sendCredentials=false&seqNum=1
2019-06-20T18:06:49Z [ERROR] Stopping redundant reads on closed network connection: https://ecs-a-6.us-east-1.amazonaws.com/ws?agentHash=0821fbc7&agentVersion=1.25.2&clusterArn=DES-E1-USEZ&containerInstanceArn=arn%3Aaws%3Aecs%3Aus-east-1%3A361190373704%3Acontainer-instance%2F575a2c1d-ff75-4cfc-8643-1ac0481f2f3a&dockerVersion=DockerVersion%3A+18.06.1-ce&sendCredentials=false&seqNum=1
2019-06-20T18:06:56Z [WARN] Unable to extend read deadline for ACS connection: set tcp 10.66.9.203:44018: use of closed network connection
2019-06-20T18:07:00Z [INFO] Managed task [arn:aws:ecs:us-east-1:361190373704:task/63bae6e8-b975-4661-b67e-ae06b9e31327]: redundant container state change. DES-desg to NONE, but already RUNNING
2019-06-20T18:07:00Z [WARN] Unable to set read deadline for websocket connection: set tcp 10.66.9.203:44018: use of closed network connection for https://ecs-a-6.us-east-1.amazonaws.com/ws?agentHash=0821fbc7&agentVersion=1.25.2&clusterArn=DES-E1-USEZ&containerInstanceArn=arn%3Aaws%3Aecs%3Aus-east-1%3A361190373704%3Acontainer-instance%2F575a2c1d-ff75-4cfc-8643-1ac0481f2f3a&dockerVersion=DockerVersion%3A+18.06.1-ce&sendCredentials=false&seqNum=1
2019-06-20T18:07:01Z [WARN] DockerGoClient: inactivity time exceeded timeout while retrieving stats for container 8133d8c0371ab05231a9eefd8c27deeb54311c81be5c131670536cbbcf493670
2019-06-20T18:07:02Z [INFO] Managed task [arn:aws:ecs:us-east-1:361190373704:task/63bae6e8-b975-4661-b67e-ae06b9e31327]: task at steady state: RUNNING
2019-06-20T18:07:02Z [ERROR] Stopping redundant reads on closed network connection: https://ecs-a-6.us-east-1.amazonaws.com/ws?agentHash=0821fbc7&agentVersion=1.25.2&clusterArn=DES-E1-USEZ&containerInstanceArn=arn%3Aaws%3Aecs%3Aus-east-1%3A361190373704%3Acontainer-instance%2F575a2c1d-ff75-4cfc-8643-1ac0481f2f3a&dockerVersion=DockerVersion%3A+18.06.1-ce&sendCredentials=false&seqNum=1
2019-06-20T18:07:02Z [INFO] Reconnecting to ACS in: 253.312033ms
2019-06-20T18:07:19Z [INFO] Establishing a Websocket connection to https://ecs-a-6.us-east-1.amazonaws.com/ws?agentHash=0821fbc7&agentVersion=1.25.2&clusterArn=DES-E1-USEZ&containerInstanceArn=arn%3Aaws%3Aecs%3Aus-east-1%3A361190373704%3Acontainer-instance%2F575a2c1d-ff75-4cfc-8643-1ac0481f2f3a&dockerVersion=DockerVersion%3A+18.06.1-ce&sendCredentials=false&seqNum=1
2019-06-20T18:07:56Z [WARN] Error creating a websocket client: dial tcp: i/o timeout
2019-06-20T18:08:02Z [ERROR] Error connecting to ACS: websocket client: unable to dial ecs-a-6.us-east-1.amazonaws.com response: : dial tcp: i/o timeout
2019-06-20T18:08:10Z [INFO] Reconnecting to ACS in: 428.558079ms
2019-06-20T18:08:57Z [INFO] Establishing a Websocket connection to https://ecs-a-6.us-east-1.amazonaws.com/ws?agentHash=0821fbc7&agentVersion=1.25.2&clusterArn=DES-E1-USEZ&containerInstanceArn=arn%3Aaws%3Aecs%3Aus-east-1%3A361190373704%3Acontainer-instance%2F575a2c1d-ff75-4cfc-8643-1ac0481f2f3a&dockerVersion=DockerVersion%3A+18.06.1-ce&sendCredentials=false&seqNum=1
2019-06-20T18:09:36Z [WARN] Error creating a websocket client: write tcp 10.66.9.203:46356->10.51.176.5:8080: i/o timeout
2019-06-20T18:09:36Z [ERROR] Error connecting to TCS: websocket client: unable to dial ecs-t-6.us-east-1.amazonaws.com response: : write tcp 10.66.9.203:46356->10.51.176.5:8080: i/o timeout
2019-06-20T18:09:36Z [INFO] Error from tcs; backing off: websocket client: unable to dial ecs-t-6.us-east-1.amazonaws.com response: : write tcp 10.66.9.203:46356->10.51.176.5:8080: i/o timeout
2019-06-20T18:09:38Z [INFO] Connected to ACS endpoint
2019-06-20T18:09:40Z [INFO] Saving state! module="statemanager"
2019-06-20T18:09:40Z [INFO] Task engine [arn:aws:ecs:us-east-1:361190373704:task/63bae6e8-b975-4661-b67e-ae06b9e31327]: stopping container [DES-desg]
2019-06-20T18:09:40Z [INFO] Managed task [arn:aws:ecs:us-east-1:361190373704:task/c838234c-e247-418d-9ffe-f3536231c187]: unable to create task state change event []: create task state change event api: status not recognized by ECS: NONE
2019-06-20T18:09:40Z [INFO] Managed task [arn:aws:ecs:us-east-1:361190373704:task/c838234c-e247-418d-9ffe-f3536231c187]: waiting for any previous stops to complete. Sequence number: 5
2019-06-20T18:09:41Z [INFO] Establishing a Websocket connection to https://ecs-t-6.us-east-1.amazonaws.com/ws?cluster=DES-E1-USEZ&containerInstance=arn%3Aaws%3Aecs%3Aus-east-1%3A361190373704%3Acontainer-instance%2F575a2c1d-ff75-4cfc-8643-1ac0481f2f3a
2019-06-20T18:09:43Z [INFO] Connected to TCS endpoint
2019-06-20T18:09:49Z [WARN] DockerGoClient: Unable to decode stats for container 8133d8c0371ab05231a9eefd8c27deeb54311c81be5c131670536cbbcf493670: context canceled
2019-06-20T18:09:53Z [INFO] Saving state! module="statemanager"
2019-06-20T18:10:33Z [WARN] Error getting cpu stats, err: No data in the queue, container: 8133d8c0371ab05231a9eefd8c27deeb54311c81be5c131670536cbbcf493670
2019-06-20T18:10:39Z [WARN] Error publishing metrics: stats engine: no task metrics to report
2019-06-20T18:10:50Z [INFO] Task engine [arn:aws:ecs:us-east-1:361190373704:task/63bae6e8-b975-4661-b67e-ae06b9e31327]: error transitioning container [DES-desg] to [STOPPED]: Could not transition to stopped; timed out after waiting 1m0s
2019-06-20T18:10:58Z [INFO] TCS Connection hasn't had any activity for too long; disconnecting
2019-06-20T18:11:06Z [INFO] DockerGoClient: error stopping container 8133d8c0371ab05231a9eefd8c27deeb54311c81be5c131670536cbbcf493670: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
2019-06-20T18:11:13Z [WARN] Error getting cpu stats, err: No data in the queue, container: 8133d8c0371ab05231a9eefd8c27deeb54311c81be5c131670536cbbcf493670
2019-06-20T18:11:20Z [WARN] Unable to set read deadline for websocket connection: set tcp 10.66.9.203:46394: use of closed network connection for https://ecs-t-6.us-east-1.amazonaws.com/

Any update or help?

vfa-thanhld commented 5 years ago

@dineshkm did you fix issue? i have same issue too

dineshkm commented 5 years ago

@vfa-thanhld We suspect this is related to low memory. Looks our application is consuming more memory and interrupts the instance. Currently, we switched from c5.xlarge to m5.xlarge instance with 16GB, so far this issue is not found.

mhuebner commented 5 years ago

Same issue here. Already raised a support ticket (not answered, yet). My assumption is (as already supposed by @dineshkm) it must be related to low memory on the EC2 instance. Our very small clustered frontend is running on t3.small instances. When we deployed a new release of our app (JAVA), we were suddenly hit by this problem. The new release had a bunch of new features and seems to need some more memory. Mysteriously docker stats shows enough free memory for the container (700MB free). As already mentioned I suggest that the EC2 instance itself might need more memory?

UPDATE 1: I just reduced memory usage of the container task. EC2 instance which is running docker service and the ecs agent has now about 250 MB of memory for system critical processes. This did not solve the issue. The EC2 instance is running ecs agent version 1.29.1. Elb health checks fail keep restarting the task because of reason Request timed out.

UPDATE 2: I resolved this issue by increasing the value for "Timeout" of the TargetGroup's health check settings to a value of 20 seconds. I think a best practice would be to set the Timeout to a very high value then check the restarting behaviour of the agent and decrease the Timeout's value until you have your expected results.

ArpitaSheelvant commented 5 years ago

Same issue here. Task is getting stopped and start every half an hour periodically with exit code 1 without any reason of failure.

yhlee-aws commented 5 years ago

If you are seeing the same issue, could you please send the following information to yhlee at amazon dot com please?

ArpitaSheelvant commented 5 years ago

@yunhee-l Replied to the mail.

yhlee-aws commented 5 years ago

@ArpitaSheelvant Please provide logs, otherwise we cannot debug in any meaningful manner.

Villa41 commented 5 years ago

Is there any solution? I am having the same error

mohit3081989 commented 5 years ago

I am also facing same problem, please look at https://stackoverflow.com/questions/58059851/aws-ecs-service-tasks-getting-replaced-with-reason-request-timed-out for more details.

I am pasting it here again :

We are running ECS as container orchestration layer for more than 2 years. But there is one problem which we are not able to figure out the reason for, In few of our (node.js) services we have started observing errors in ECS events as

service example-service (instance i-016b0a460d9974567) (port 1047) is unhealthy in target-group example-service due to (reason Request timed out)

This causes our dependent service to start experiencing 504 gateway timeout which impacts them in big way.

  1. Upgraded Docker storage driver from devicemapper to overlay2

  2. We increased the resources for all ECS instances including CPU, RAM and EBS storage as we saw in few containers.

  3. We increase health check grace period for the service from 0 to 240secs

  4. Increased KeepAliveTimeout and SocketTimeout to 180 secs

  5. Enabled awslogs on containers instead of stdout, but there was no unusual behavior

  6. Enabled ECSMetaData at container and pipelined all information in our application logs. This helped us in looking all the logs for problematic container only.

  7. Enabled container insights for better container level debugging

Out of this things which helped the most if upgrading devicemapper to overlay2 storage driver and increasing healthcheck grace period.

The amount of errors have come down amazingly with these two but still we are getting this issue once a while.

We have seen all the graphs related to instance and container which went down below are the logs for it:

ECS container insights logs for victim container :

Query :

fields CpuUtilized, MemoryUtilized, @message
| filter Type = "Container" and EC2InstanceId = "i-016b0a460d9974567" and TaskId = "dac7a872-5536-482f-a2f8-d2234f9db6df"

Example Logs answered :

{
"Version":"0",
"Type":"Container",
"ContainerName":"example-service",
"TaskId":"dac7a872-5536-482f-a2f8-d2234f9db6df",
"TaskDefinitionFamily":"example-service",
"TaskDefinitionRevision":"2048",
"ContainerInstanceId":"74306e00-e32a-4287-a201-72084d3364f6",
"EC2InstanceId":"i-016b0a460d9974567",
"ServiceName":"example-service",
"ClusterName":"example-service-cluster",
"Timestamp":1569227760000,
"CpuUtilized":1024.144923245614,
"CpuReserved":1347.0,
"MemoryUtilized":871,
"MemoryReserved":1857,
"StorageReadBytes":0,
"StorageWriteBytes":577536,
"NetworkRxBytes":14441583,
"NetworkRxDropped":0,
"NetworkRxErrors":0,
"NetworkRxPackets":17324,
"NetworkTxBytes":6136916,
"NetworkTxDropped":0,
"NetworkTxErrors":0,
"NetworkTxPackets":16989
}

None of logs were having CPU and Memory utilised ridiculously high.

We stopped getting responses from the victim container at let's say t1, we got errors in dependent services at t1+2mins and container was taken away by ECS at t1+3mins

Our health check configurations are below :

Protocol HTTP
Path  /healthcheck
Port traffic port
Healthy threshold  10
Unhealthy threshold 2
Timeout  5
Interval 10
Success codes 200

Let me know if you need any more information, I will be happy to provide it. Configurations which we are running are :

docker info
Containers: 11
 Running: 11
 Paused: 0
 Stopped: 0
Images: 6
Server Version: 18.06.1-ce
Storage Driver: overlay2
 Backing Filesystem: xfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 468a545b9edcd5932818eb9de8e72413e616e86e
runc version: 69663f0bd4b60df09991c08812a60108003fa340
init version: fec3683
Security Options:
 seccomp
  Profile: default
Kernel Version: 4.14.138-89.102.amzn1.x86_64
Operating System: Amazon Linux AMI 2018.03
OSType: linux
Architecture: x86_64
CPUs: 16
Total Memory: 30.41GiB
Name: ip-172-32-6-105
ID: IV65:3LKL:JESM:UFA4:X5RZ:M4NZ:O3BY:IZ2T:UDFW:XCGW:55PW:D7JH
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

There should some indication about resource contention or service crashing or genuine network failure to explain all this. But as mentioned there was nothing which we got to know caused any issue.

mohit3081989 commented 5 years ago

Also there is similar issue https://github.com/aws/amazon-ecs-init/issues/258, which no-one seems to be replying at.

cyastella commented 5 years ago

Hi, From the error message “service example-service (instance i-016b0a460d9974567) (port 1047) is unhealthy in target-group example-service due to (reason Request timed out)”, it means that the task is stopped by ECS service scheduler because of an ELB issue. To further investigate. When you can reproduce the error, can you provide the debug level agent logs and docker logs (see https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-logs-collector.html) use to ecs-agent-external AT amazon dot com. When the error happens, can you also provide the timestamp, region and error message(see https://docs.aws.amazon.com/AmazonECS/latest/userguide/service-event-messages.html)?

Thanks

cyastella commented 5 years ago

Closing this now since this needs more info in order for us to troubleshoot. Feel free to reopen if you have a chance to reproduce and send us the logs.

Thanks,

mohit3081989 commented 4 years ago

Hi @cyastella ,

Can you please re-open this up ? There has been delay in response by a month at your end, due to which I couldn't actively communicate here. I tried to get back to this issue again and here are the logs that I found for the task which faced problem :

t.txt

fierlion commented 4 years ago

I'm closing this as a duplicate; we are working this in mohit3081989's ecs-init issue here: https://github.com/aws/amazon-ecs-init/issues/258

Zogoo commented 3 years ago

@fierlion I could not see any proper answer or explanation. And I think it's not really duplicated with the other issue. Because I have examples that ecs-agent get frozen few minutes during only for production deployment.

  1. Our new ECS tasks get frozen during the start
    I, [2021-10-01T18:19:42.159506 #8] INFO -- : Refreshing Gem list
    2021-10-01T18:22:54.020+09:00Copy
    W, [2021-10-01T18:22:54.020073 #8] WARN -- : Creating scope :invited. Overwriting existing method User.invited.

    Normally it takes 20 seconds

    2021-10-05T08:57:08.307+09:00I, [2021-10-05T08:57:08.304768 #10] INFO -- : Refreshing Gem list
    2021-10-05T08:57:26.728+09:00W, [2021-10-05T08:57:26.728389 #10] WARN -- : Creating scope :invited. Overwriting existing method User.invited.
  2. ECS event logs disappeared during that time, and no error in here.
    2021-10-01 18:24:07 +0900
    service prod-cluster-ecs-xxx-service has stopped 2 running tasks: task 2283435b6b4e429dbd4f290de1f3a165 task 571a86d8a0604bf78522caa83934656d.
    3d4333fc-1e4b-4ed2-b6ec-0078609836dd
    2021-10-01 18:19:31 +0900
    service prod-cluster-ecs-xxx-service registered 1 targets in target-group tl-pro20201224052257744700000002
    e04bbcd0-914a-4605-a539-41f2a9a4ebc3
  3. Meanwhile, other existing ECS tasks log was just blackout exactly during that time.
  4. In CloudWatch we don't see that much CPU or Mem utilization which affect to containers
  5. Exactly same environment (Created by same Terraform script) works fine with other AWS account.