aws / amazon-ecs-agent

Amazon Elastic Container Service Agent
http://aws.amazon.com/ecs/
Apache License 2.0
2.07k stars 608 forks source link

Task stuck in pending state #731

Closed coryodaniel closed 6 years ago

coryodaniel commented 7 years ago

Whenever I push a new task to ECS the new tasks will hanging in a PENDING state. If I SSH in and restart the ECS agent, they transition to RUNNING and the previous tasks transition to STOPPED. Subsequent deploys work, but after some period of time, deploys will cease to work again, and the process has to be repeated. In the ecs agent docker logs I see the following warning:

Anything relevant in the ECS Agent logs (located at /var/log/ecs)

2017-03-13T17:56:06.823635005Z 2017-03-13T17:56:06Z [WARN] Error getting cpu stats, err: No data in the queue, container: 6aac647a4737cc592852b4fa3dea0e2670eba47cac12e6783a8225093be66876
2017-03-13T17:56:06.823646905Z 2017-03-13T17:56:06Z [WARN] Error getting cpu stats, err: No data in the queue, container: b367545a0e99cf3e834b446772e2fe3828c1e09150c0b6645dcb15278615a64c
2017-03-13T17:56:06.823650772Z 2017-03-13T17:56:06Z [WARN] Error getting cpu stats, err: No data in the queue, container: c723a3bc0745f908e330ac04f6cfe35c5e3dbb25d3706a3a9ac0b94f223de178
2017-03-13T17:56:06.823654194Z 2017-03-13T17:56:06Z [WARN] Error getting cpu stats, err: No data in the queue, container: 08f87808b72a246c154b271b3309240abda623330b289f409aa16fafe468e440
2017-03-13T17:56:06.823657545Z 2017-03-13T17:56:06Z [WARN] Error getting cpu stats, err: No data in the queue, container: c3ab3d3cadbed5587ffb2a3a31da89266cbc1dfb74ff211431baa93afc244159
2017-03-13T17:56:06.823660881Z 2017-03-13T17:56:06Z [WARN] Error getting instance metrics: No task metrics to report
2017-03-13T17:56:40.274042585Z 2017-03-13T17:56:40Z [INFO] No eligible images for deletion for this cleanup cycle
2017-03-13T17:56:40.276963819Z 2017-03-13T17:56:40Z [INFO] End of eligible images for deletion

Version Info

{
  "agentVersion": "1.14.0",
  "agentHash": "f88e52e",
  "dockerVersion": "DockerVersion: 1.12.6"
}

The agentConnected value when you call DescribeContainerInstances for that instance: "agentConnected": true,

Output of docker info

Containers: 22
 Running: 11
 Paused: 0
 Stopped: 11
Images: 8
Server Version: 1.12.6
Storage Driver: devicemapper
 Pool Name: docker-docker--pool
 Pool Blocksize: 524.3 kB
 Base Device Size: 10.74 GB
 Backing Filesystem: ext4
 Data file:
 Metadata file:
 Data Space Used: 6.992 GB
 Data Space Total: 23.35 GB
 Data Space Available: 16.36 GB
 Metadata Space Used: 2.765 MB
 Metadata Space Total: 62.91 MB
 Metadata Space Available: 60.15 MB
 Thin Pool Minimum Free Space: 2.335 GB
 Udev Sync Supported: true
 Deferred Removal Enabled: true
 Deferred Deletion Enabled: true
 Deferred Deleted Device Count: 0
 Library Version: 1.02.93-RHEL7 (2015-01-28)
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: host bridge overlay null
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Security Options:
Kernel Version: 4.4.19-29.55.amzn1.x86_64
Operating System: Amazon Linux AMI 2016.09
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 7.8 GiB
Name: ip-192-168-8-6
ID: 335R:7NKS:FOKJ:ESGT:B3WV:SS5P:P5G6:NPCP:65T3:EZF5:YYH6:WF33
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Insecure Registries:
 127.0.0.0/8

Anything relevant in the Docker daemon log (located at /var/log/docker):

time="2017-03-13T18:17:27.851502907Z" level=error msg="Handler for GET /containers/undefined/json returned error: No such container: undefined"
time="2017-03-13T18:34:48.525604354Z" level=info msg="Container b20a0a06cf43ea625ae2e92b0fcf580a9bca4b163fd519300fde7d4542b99a13 failed to exit within 1200 seconds of signal 15 - using the force"
time="2017-03-13T18:34:48.527962030Z" level=info msg="Container e092600d8e66de0da1255c681fc79827f9439897681ec4178622ab2b2182904f failed to exit within 1200 seconds of signal 15 - using the force"
time="2017-03-13T18:34:48.534819749Z" level=info msg="Container 9a15d3f8504c331af47be01a0af8477acb3d86487694ae2f299b58f275f9ab42 failed to exit within 1200 seconds of signal 15 - using the force"
time="2017-03-13T18:34:48.977125171Z" level=error msg="Handler for GET /containers/undefined/json returned error: No such container: undefined"
time="2017-03-13T18:34:48.977972016Z" level=error msg="Handler for GET /containers/undefined/json returned error: No such container: undefined"
time="2017-03-13T18:34:48.979214737Z" level=error msg="Handler for GET /containers/undefined/json returned error: No such container: undefined"
time="2017-03-13T18:53:39.928005206Z" level=error msg="Handler for GET /containers/undefined/json returned error: No such container: undefined"
time="2017-03-13T18:54:55.906842370Z" level=error msg="Handler for GET /containers/undefined/json returned error: No such container: undefined"
time="2017-03-13T18:54:59.642429960Z" level=error msg="Handler for GET /containers/undefined/json returned error: No such container: undefined"

Output of docker inspect where is the container that appears to be stuck: The docker container does not show up at all. If I call curl 127.0.0.1:51678/v1/tasks I see the task in PENDING with desired state of RUNNING and the dockerId is null

liwenwu-amazon commented 7 years ago

I am sorry that you're experiencing problems using ECS. Can you provide the following information? If you are not comfortable posting it publicly on GitHub, you can email it directly to me at liwenwu at amazon.com.

coryodaniel commented 7 years ago

@liwenwu-amazon Yes! I'll email you them in the morning. Thanks!

coryodaniel commented 7 years ago

Thanks @liwenwu-amazon. I just sent you an email.

richardpen commented 7 years ago

@coryodaniel Thanks for sending the logs. After taking a look at the logs, I think the task was stuck in waiting for previous task to stop. Since the log isn't in Debug level, we didn't get enough information to find the root cause. In order to help us find the root cause, could you set the log level to by adding ECS_LOGLEVEL=debug in /etc/ecs/ecs.config, then reproduce this issue and collect the logs with this tool?

Also, could you provide the following information in terms of the previous logs:

Thanks, Richard

coryodaniel commented 7 years ago

We did set ECS_CONTAINER_STOP_TIMEOUT and we are trapping SIGTERM and exiting 0.

I don't recall seeing any errors in the docker logs, but I'll look when I reproduce. Where should I send the logs?

richardpen commented 7 years ago

@coryodaniel Please send to my email: penyin (at) amazon.com

Thanks, Richard

coryodaniel commented 7 years ago

This is still an issue, someone nuked my instances that I had the debug logs on. I'll send as soon as it hangs again.

aaithal commented 7 years ago

Hi @coryodaniel, apologies for the late response here. Thank you for sending these logs. To explain detail what's happening here, the behavior that you described in your first comment when you created the issue is how the ECS Agent is supposed to function:

Whenever I push a new task to ECS the new tasks will hanging in a PENDING state.

Whenever ECS Agent receives new tasks to start from the ECS backend (PENDING->RUNNING), if it also has other tasks to stop, it won't start these new tasks until those old tasks are stopped. This is to ensure that resources are actually available to containers in the new task when they come up as not waiting for containers of tasks that are in RUNNING -> STOPPED transition can lead to incorrect allocation of resources on the host.

In the logs that you sent, the agent was to told to stop 4 running/pending tasks and start 4 new tasks. It was waiting for running tasks to stop before starting new ones. The real issue is that it's taking a very long time for the docker daemon to stop these containers which is why there's a delay in stopping old tasks and subsequently a delay in starting new tasks. It's not clear from the docker logs why there's a delay in stopping these containers. The most common reason for such slowness that we've seen is when the EBS volume runs out of IO credits. You can check the Burst Balance metric that can help provide visibility into the credit balance of gp2 volumes. If that's not the root-cause, then running the docker daemon in debug mode and sending us those logs when this happens again would be the best course of action to help us further diagnose what's causing this.

If I SSH in and restart the ECS agent, they transition to RUNNING and the previous tasks transition to STOPPED.

This is actually a bug in the ECS Agent that new tasks are being started before the ones that are supposed to be stopped are actually stopped.

coryodaniel commented 7 years ago

I can check the Balance Metric next time we deploy. We were logging inside our application and the tasks never receive a SIGTERM and by never I mean we let the run for hours and they never received it. Not sure if that helps to point us in any other direction?

coryodaniel commented 7 years ago

@aaithal checked the burst metrics and it was 99.998% available at the time of the issue.

reedflinch commented 7 years ago

Just wanted to drop in and say that I'm having a similar issue with tasks stuck in PENDING state. I'm unable to get logs now, but can report back if necessary.

I've set ECS_CONTAINER_STOP_TIMEOUT=5m and I'm running ecs-agent v1.14.2 and Docker v17.03.1-ce.

vsiddharth commented 7 years ago

ecs-agent v1.14.2 suffers from #833 which was resolved recently. Since then we have released a newer agent v1.14.3 and published new AMIs

reedflinch commented 7 years ago

Perfect, thank you @vsiddharth. I have actually prepared to update to this newest ECS-optimized AMI this weekend 🙂

Side note: I was surprised to see this AMI update because I usually get notifications/emails of newly-released versions and had not. Understand that this was pushed out very quickly though, so not a problem now that I know!

coryodaniel commented 7 years ago

We updated to the new agent and are still experiencing the issue. Are there any other logs we can get to you to troubleshoot?

adnxn commented 7 years ago

hey @coryodaniel - just clarify. Which agent version are you running? Are you on the latest ECS-optimized AMI as well?

@aaithal checked the burst metrics and it was 99.998% available at the time of the issue.

Thanks for checking this. Since this doesn't seem to be the problem, we can move forward and start investigating debug logs. It would be very helpful if you can reproduce the problem with the agent running with Debug level logs and Docker running in debug mode.

Are there any other logs we can get to you to troubleshoot?

If you're able to run the ECS log collector and send me the resulting archive, we may be able to learn more from the logs. You can send them to me directly adnkha at amazon dot com, and please feel free to comment on this issue when you've sent it out. Thanks! =]

curena commented 7 years ago

We're experiencing a very similar issue at my company. We have the latest AMI (as of today, amzn-ami-2017.03.d-amazon-ecs-optimized, ami-57d9cd2e.)

Whenever we deploy a new task definition, the ECS Agent attempts to stop the previously running containers, but the Docker daemon seems to just spin its wheels. On the agent side, 1 minute after the first time it issues a STOP request, we get the error:

2017-07-13T18:14:45Z [INFO] DockerTimeoutError for 'docker stop' of container; ignoring state change;  task: rcx-director-task-dev-us-west-2:422 arn:aws:ecs:us-west-2:969204706979:task/0878e08e-c28c-4337-bc94-d13eeda5abf3, Status: (RUNNING->STOPPED) Containers: [rcx-director-task-dev-us-west-2 (RUNNING->STOPPED),], container: rcx-director-task-dev-us-west-2(304872096477.dkr.ecr.us-west-2.amazonaws.com/krp/rcx-director:3.2467) (RUNNING->STOPPED), error: Could not transition to stopped; timed out after waiting 1m0s

And this every 5 seconds:

2017-07-13T18:29:53Z [INFO] Error retrieving stats for container b8f7c535b706d45a16e946c68c42b615927d8ab2fa0c8a1dd374d8bac712351b: inactivity time exceeded timeout

This happens every minute afterwards until about 15 minutes later, at which point finally get the DIE event from Docker for that particular container:

2017-07-13T18:30:12Z [DEBUG] Got event from docker daemon module="TaskEngine" event="&{Action:die Type:container Actor:{ID:c0074caf7975201907fe25f0363cf945e71da7d19455aa6b3db973462a877c35 Attributes:map[name:ecs-rcx-director-task-dev-us-west-2-422-rcx-director-task-dev-us-west-2-a6fee8eaeee2fa875e00 com.amazonaws.ecs.cluster:krp-service-ecs-us-west-2 com.amazonaws.ecs.container-name:rcx-director-task-dev-us-west-2 com.amazonaws.ecs.task-arn:arn:aws:ecs:us-west-2:969204706979:task/0878e08e-c28c-4337-bc94-d13eeda5abf3 com.amazonaws.ecs.task-definition-family:rcx-director-task-dev-us-west-2 com.amazonaws.ecs.task-definition-version:422 exitCode:143 image:304872096477.dkr.ecr.us-west-2.amazonaws.com/krp/rcx-director:3.2467]} Status:die ID:c0074caf7975201907fe25f0363cf945e71da7d19455aa6b3db973462a877c35 From:304872096477.dkr.ecr.us-west-2.amazonaws.com/krp/rcx-director:3.2467 Time:1499970612 TimeNano:1499970612661283113}"

On the Docker side, at the same time:

time="2017-07-13T18:14:15.791176524Z" level=info msg="Container c0074caf7975201907fe25f0363cf945e71da7d19455aa6b3db973462a877c35 failed to exit within 30 seconds of signal 15 - using the force" 
time="2017-07-13T18:14:45.791652199Z" level=debug msg="Calling POST /v1.17/containers/c0074caf7975201907fe25f0363cf945e71da7d19455aa6b3db973462a877c35/stop?t=30" 
time="2017-07-13T18:15:45.793161798Z" level=debug msg="Calling POST /v1.17/containers/c0074caf7975201907fe25f0363cf945e71da7d19455aa6b3db973462a877c35/stop?t=30" 

... Every minute from here on, until 15 minutes later. After 15 minutes,

time="2017-07-13T18:30:12.661686530Z" level=debug msg="Revoking external connectivity on endpoint ecs-rcx-director-task-dev-us-west-2-422-rcx-director-task-dev-us-west-2-a6fee8eaeee2fa875e00 (07267023da6183112ccb933721fae9dff68fef1ddc4054d30af1b1ebb04f4b4f)" 
time="2017-07-13T18:30:12.662042563Z" level=debug msg="Calling GET /v1.17/containers/c0074caf7975201907fe25f0363cf945e71da7d19455aa6b3db973462a877c35/json" 
time="2017-07-13T18:30:12.662454284Z" level=debug msg="/sbin/iptables, [-t nat -C DOCKER -p tcp -d 0/0 --dport 32787 -j DNAT --to-destination 172.17.0.2:8080 ! -i docker0]" 
time="2017-07-13T18:30:12.664710386Z" level=debug msg="/sbin/iptables, [-t nat -D DOCKER -p tcp -d 0/0 --dport 32787 -j DNAT --to-destination 172.17.0.2:8080 ! -i docker0]" 
time="2017-07-13T18:30:12.666424400Z" level=debug msg="/sbin/iptables, [-t filter -C DOCKER ! -i docker0 -o docker0 -p tcp -d 172.17.0.2 --dport 8080 -j ACCEPT]" 
time="2017-07-13T18:30:12.668299534Z" level=debug msg="/sbin/iptables, [-t filter -D DOCKER ! -i docker0 -o docker0 -p tcp -d 172.17.0.2 --dport 8080 -j ACCEPT]" 
time="2017-07-13T18:30:12.670087608Z" level=debug msg="/sbin/iptables, [-t nat -C POSTROUTING -p tcp -s 172.17.0.2 -d 172.17.0.2 --dport 8080 -j MASQUERADE]" 
time="2017-07-13T18:30:12.671832129Z" level=debug msg="/sbin/iptables, [-t nat -D POSTROUTING -p tcp -s 172.17.0.2 -d 172.17.0.2 --dport 8080 -j MASQUERADE]" 
time="2017-07-13T18:30:12.818506002Z" level=debug msg="Releasing addresses for endpoint ecs-rcx-director-task-dev-us-west-2-422-rcx-director-task-dev-us-west-2-a6fee8eaeee2fa875e00's interface on network bridge" 
time="2017-07-13T18:30:12.818537240Z" level=debug msg="ReleaseAddress(LocalDefault/172.17.0.0/16, 172.17.0.2)" 
time="2017-07-13T18:30:12.865082903Z" level=debug msg="devmapper: UnmountDevice START(hash=417b59c1f7de2b59372f12a8b3b488a45f8eea39fee5201a1e37a05a48ebade4)" 
time="2017-07-13T18:30:12.865120164Z" level=debug msg="devmapper: Unmount(/var/lib/docker/devicemapper/mnt/417b59c1f7de2b59372f12a8b3b488a45f8eea39fee5201a1e37a05a48ebade4)" 
time="2017-07-13T18:30:12.906867615Z" level=debug msg="devmapper: Unmount done" 
time="2017-07-13T18:30:12.906894245Z" level=debug msg="devmapper: deactivateDevice START(417b59c1f7de2b59372f12a8b3b488a45f8eea39fee5201a1e37a05a48ebade4)" 
time="2017-07-13T18:30:12.906968023Z" level=debug msg="devicemapper: RemoveDeviceDeferred START(docker-202:1-263222-417b59c1f7de2b59372f12a8b3b488a45f8eea39fee5201a1e37a05a48ebade4)" 
time="2017-07-13T18:30:12.985097971Z" level=debug msg="devicemapper: RemoveDeviceDeferred END(docker-202:1-263222-417b59c1f7de2b59372f12a8b3b488a45f8eea39fee5201a1e37a05a48ebade4)" 
time="2017-07-13T18:30:12.985129480Z" level=debug msg="devmapper: deactivateDevice END(417b59c1f7de2b59372f12a8b3b488a45f8eea39fee5201a1e37a05a48ebade4)" 
time="2017-07-13T18:30:12.985144162Z" level=debug msg="devmapper: UnmountDevice END(hash=417b59c1f7de2b59372f12a8b3b488a45f8eea39fee5201a1e37a05a48ebade4)" 
time="2017-07-13T18:30:12.990187397Z" level=warning msg="Container c0074caf7975201907fe25f0363cf945e71da7d19455aa6b3db973462a877c35 is not running" 
time="2017-07-13T18:30:12.990466755Z" level=error msg="Handler for GET /v1.17/containers/c0074caf7975201907fe25f0363cf945e71da7d19455aa6b3db973462a877c35/json returned error: write unix /var/run/docker.sock->@: write: broken pipe"

We've enabled debug mode in our Docker daemon and collected logs via the ECS log collector. We'd really appreciate some pointers here! Let me know if (and where) you'd like me to send you the full log zip.

adnxn commented 7 years ago

hey @curena - given the logs you provided, we suspect the underlying issue is that docker is failing to stop the container. we can see this in the following log line. the agent receives the DockerTimeoutError error and does not transition the task to stopped, which is the expected behavior.

2017-07-13T18:14:45Z [INFO] DockerTimeoutError for 'docker stop' of container; ignoring state change;  task: rcx-director-task-dev-us-west-2:422 arn:aws:ecs:us-west-2:969204706979:task/0878e08e-c28c-4337-bc94-d13eeda5abf3, Status: (RUNNING->STOPPED) Containers: [rcx-director-task-dev-us-west-2 (RUNNING->STOPPED),], container: rcx-director-task-dev-us-west-2(304872096477.dkr.ecr.us-west-2.amazonaws.com/krp/rcx-director:3.2467) (RUNNING->STOPPED), error: Could not transition to stopped; timed out after waiting 1m0s

the following line in the docker debug logs is interesting and again indicative of something strange going on with docker.

time="2017-07-13T18:30:12.990466755Z" level=error msg="Handler for GET /v1.17/containers/c0074caf7975201907fe25f0363cf945e71da7d19455aa6b3db973462a877c35/json returned error: write unix /var/run/docker.sock->@: write: broken pipe"

the following line is likely the result of docker being in an unstable state, we have seen this behavior in the past where docker fails to stop containers.

2017-07-13T18:29:53Z [INFO] Error retrieving stats for container b8f7c535b706d45a16e946c68c42b615927d8ab2fa0c8a1dd374d8bac712351b: inactivity time exceeded timeout

Let me know if (and where) you'd like me to send you the full log zip.

would you be able repro this on a test stack and force a docker stack trace to be logged? you can send all this information to me directly at adnkha at amazon dot com. thanks !

curena commented 7 years ago

@adnxn I've sent you the requested items. Thanks in advance!

miskiw commented 7 years ago

I am also seeing this issue fairly consistently now with ECS agent v1.14.3. Docker v17.03.1-ce

I see the agent is able to pull the new image but doesn't seem to start the new image and will not kill the older image.

rthrfrd commented 7 years ago

We're seeing this an awful lot! For services running >30 containers simultaneously per instance in cluster, it may only take an hour or two before an instance becomes completely unusable and all newly-started service-managed tasks are stuck in pending. Eventually there are no tasks running at all and the instance has to be terminated. Haven't seen any obvious errors in any logs, just a lot of calls to the docker API for non-existent containers.

Happy to provide access to a broken instance to someone at AWS if it helps debug the issue (usually only takes an hour or two to create one! :wink: ), becoming quite critical for us.

aaithal commented 7 years ago

@rthrfrd We're sorry that you're running into this issue. Can you please share the log files from the instance (instructions can be found here) with us? You can email them to aithal at amazon dot com. You can also check if you're running out of EBS Credits (here's an issue with instructions on how to do it).

Thanks, Anirudh

rthrfrd commented 7 years ago

@aaithal Thank you for the response! I'm sorry I missed that part, burst balance is indeed exhausted on that volume. We use quite a large gp2 at the moment to get the throughput, I guess better to try dedicated IOPS. We've experimented with various storage drivers over the last few years, all with different issues, now use overlay2 to avoid inode exhaustion. Thanks again.

aaithal commented 7 years ago

Hi @curena Apologies for getting back to you late about the logs you sent us. The docker daemon logs and the stack trace indicate some slowness/issues with the docker daemon and the splunk logging setup on the instance, because of which the Agent is unable to stop containers. I'm not sure if the slowness in the daemon is related to the splunk issue, but ensuring the splunk is set up correctly and behaving the way it should would be a good point to start:

Sample splunk errors:

time="2017-07-12T20:37:57.301246818Z" level=error msg="Post https://splunk11.xxx.net:8088/services/collector/event/1.0: read tcp xx.xx.xx.xx:57504->yy.yy.yy.yy:8088: read: connection timed out" 
time="2017-07-12T20:37:57.301404478Z" level=error msg="Failed to send a message '{\"event\":{\"line\":{\"datetime\":\"2017-07-12 20:21:30.891\",\"message\":\"Unexpected error occurred in scheduled task.com.amazonaws.AbortedException: \\n\\tat com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleInterruptedException(AmazonHttpClient.java:791)\\n\\tat com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:699)\\n\\tat com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:665)\\n\\tat com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:647)\\n\\tat com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:511)\\n\\tat com.amazonaws.services.sqs.AmazonSQSClient.doInvoke(AmazonSQSClient.java:1740)\\n\\tat com.amazonaws.services.sqs.AmazonSQSClient.invoke(AmazonSQSClient.java:1716)\\n\\tat com.amazonaws.services.sqs.AmazonSQSClient.executeReceiveMessage(AmazonSQSClient.java:1380)\\n\\tat com.amazonaws.services.sqs.AmazonSQSClient.receiveMessage(AmazonSQSClient.java:1356)\\n\\tat com.kasasa.aws.sqs.SqsUtils.receiveMessages(SqsUtils.java:55)\\n\\tat com.bancvue.rcx.director.SourceFileSqsSubscriber.receiveMessages(SourceFileSqsSubscriber.java:55)\\n\\tat sun.reflect.GeneratedMethodAccessor131.invoke(Unknown Source)\\n\\tat sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\\n\\tat java.lang.reflect.Method.invoke(Method.java:498)\\n\\tat org.springframework.scheduling.support.ScheduledMethodRunnable.run(ScheduledMethodRunnable.java:65)\\n\\tat org.springframework.scheduling.support.DelegatingErrorHandlingRunnable.run(DelegatingErrorHandlingRunnable.java:54)\\n\\tat org.springframework.scheduling.concurrent.ReschedulingRunnable.run(ReschedulingRunnable.java:81)\\n\\tat java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)\\n\\tat java.util.concurrent.FutureTask.run(FutureTask.java:266)\\n\\tat java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)\\n\\tat java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)\\n\\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)\\n\\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)\\n\\tat java.lang.Thread.run(Thread.java:748)\\nCaused by: com.amazonaws.http.timers.client.SdkInterruptedException: null\\n\\tat com.amazonaws.http.AmazonHttpClient$RequestExecutor.checkInterrupted(AmazonHttpClient.java:835)\\n\\tat com.amazonaws.http.AmazonHttpClient$RequestExecutor.checkInterrupted(AmazonHttpClient.java:821)\\n\\tat com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1000)\\n\\tat com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:741)\\n\\tat com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:715)\\n\\tat com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:697)\\n\\t... 22 common frames omitted\\n\",\"platform\":\"krp\",\"service\":\"rcx-director\"},\"source\":\"stdout\",\"tag\":\"rcx-director\"},\"time\":\"1499890890.892801\",\"host\":\"xx-xx-xx-xx-xx\"}'" 
time="2017-07-12T20:37:57.301578540Z" level=error msg="Failed to send a message '{\"event\":{\"line\":{\"datetime\":\"2017-07-12 20:21:30.908\",\"message\":\"Cleaning up transfer manager.\",\"platform\":\"krp\",\"service\":\"rcx-director\"},\"source\":\"stdout\",\"tag\":\"rcx-director\"},\"time\":\"1499890890.908681\",\"host\":\"xx-xx-xx-xx-xx\"}'" 

Create Container call that seems to be stuck (for 10002 minutes):

goroutine 2511902 [syscall, 1002 minutes]:
syscall.Syscall6(0xf7, 0x1, 0x714d, 0xc420157800, 0x1000004, 0x0, 0x0, 0x7f8700000001, 0x0, 0xc4200255f8)
    /usr/lib/golang/src/syscall/asm_linux_amd64.s:44 +0x5
os.(*Process).blockUntilWaitable(0xc420140cc0, 0xc4200256a8, 0x2, 0x2)
    /usr/lib/golang/src/os/wait_waitid.go:28 +0xbc
os.(*Process).wait(0xc420140cc0, 0xc420375260, 0x458c40, 0xc420025688)
    /usr/lib/golang/src/os/exec_unix.go:22 +0xab
os.(*Process).Wait(0xc420140cc0, 0x415ec5, 0xc420394df0, 0x413669)
    /usr/lib/golang/src/os/doc.go:49 +0x2b
os/exec.(*Cmd).Wait(0xc4203c8dc0, 0x180001, 0x0)
    /usr/lib/golang/src/os/exec/exec.go:434 +0x6d
github.com/docker/containerd/runtime.(*container).createCmd.func1.1(0xc4203bc680)
    /builddir/build/BUILD/docker-17.03.1-ce/containerd-422e31ce907fd9c3833a38d7b8fdd023e5a76e73/_build/src/github.com/docker/containerd/runtime/container.go:492 +0x62
created by github.com/docker/containerd/runtime.(*container).createCmd.func1
    /builddir/build/BUILD/docker-17.03.1-ce/containerd-422e31ce907fd9c3833a38d7b8fdd023e5a76e73/_build/src/github.com/docker/containerd/runtime/container.go:515 +0x3f

Docker socket file in broken state:

time="2017-07-13T16:02:35.354284088Z" level=error msg="Handler for GET /v1.17/containers/a7eae0db4e6d37223629251e3431476aa52c53404c4542751392c189ea267a2f/json returned error: write unix /var/run/docker.sock->@: write: broken pipe" 
2017-07-13 16:02:35.354317 I | http: multiple response.WriteHeader calls
time="2017-07-13T16:02:35.354345323Z" level=error msg="Handler for POST /v1.17/containers/a7eae0db4e6d37223629251e3431476aa52c53404c4542751392c189ea267a2f/stop returned error: Container a7eae0db4e6d37223629251e3431476aa52c53404c4542751392c189ea267a2f is already stopped" 
time="2017-07-13T16:02:35.354525370Z" level=error msg="Handler for GET /v1.17/containers/a7eae0db4e6d37223629251e3431476aa52c53404c4542751392c189ea267a2f/json returned error: write unix /var/run/docker.sock->@: write: broken pipe" 

Please let us know if that helps and if you're still seeing this issue. As per @adnxn's earlier comment, the inability of docker daemon to stop the container and the subsequent inability of the ECS Agent to stop the task is leading to new tasks not being launched on this instance.

Thanks, Anirudh

benjyw commented 6 years ago

Hi, I'm seeing this behavior when updating a task definition without changing the number of tasks:

I see two deployments in the Deployments tab of the service, the existing ACTIVE one with desired count=1, pending count=0, running count=1, and the new PENDING one with desired count=1, pending count=0, running count=0. It stays in this state forever.

However if I first set the desired count to 0, the running count does go to 0 (indicating that the agent is generally capable of terminating my tasks), and I can then up the count to 1 with the new task definition.

This is relatively new behavior, and only happens on some services, not others. I can't put my finger on what the salient difference is.

Agent version is v1.14.3. I see nothing obviously relevant in the ecs or docker logs. Any more information I can provide? Thanks!

benjyw commented 6 years ago

Update: I am seeing lines like this in the ecs-agent logs:

`Status: (RUNNING->RUNNING) Containers: [foo (RUNNING->RUNNING),bar (RUNNING->RUNNING),]: foo(XXXXXXXXX.dkr.ecr.us-west-2.amazonaws.com/foo:tag) (RUNNING->RUNNING) to RUNNING, but already RUNNING`

But they don't appear to be correlated with any of my deployment actions.

aaithal commented 6 years ago

I see two deployments in the Deployments tab of the service, the existing ACTIVE one with desired count=1, pending count=0, running count=1, and the new PENDING one with desired count=1, pending count=0, running count=0. It stays in this state forever.

However if I first set the desired count to 0, the running count does go to 0 (indicating that the agent is generally capable of terminating my tasks), and I can then up the count to 1 with the new task definition.

@benjyw that seems like you don't have enough resources in your cluster. Can you please confirm that you have enough capacity in your cluster and deployments are stuck even in that case?

Thanks, Anirudh

benjyw commented 6 years ago

@aaithal I can confirm that there are more than 3X the required resources in the cluster.

But your question helped me figure out the issue: I had "Minimum healthy percent" set to 50% and "Maximum percent" set to 150%. Which works fine when you have many tasks in the service, but when you're testing things out with just a single task, those percentages actually work against you, because 0 tasks is too few and 2 is too many... Lesson learned.

Thanks!

aaithal commented 6 years ago

@benjyw We're glad that you were able to resolve that! Feel free to reach out to us if you have other questions/comments. Thanks!

samuelkarp commented 6 years ago

We believe that https://github.com/aws/amazon-ecs-agent/pull/1036 should fix this. It still needs to go through code review, so I'm going to mark this for our next patch release after the upcoming minor release.

haynescw commented 6 years ago

Do you know when that patch is due to release?

praveen070290 commented 6 years ago

Same thing for me too, Status stuck on PENDING. I am using Windows in aws to work, I am trying to deploye a .net application in the ECS,

**Following the error log.**
2018-01-22T07:17:03Z [INFO] Loading configuration
2018-01-22T07:17:04Z [INFO] Unable to get Docker client for version 1.17: version detection using MinAPIVersion: unsupported version: 1.17
2018-01-22T07:17:04Z [INFO] Unable to get Docker client for version 1.18: version detection using MinAPIVersion: unsupported version: 1.18
2018-01-22T07:17:04Z [INFO] Unable to get Docker client for version 1.19: version detection using MinAPIVersion: unsupported version: 1.19
2018-01-22T07:17:04Z [INFO] Unable to get Docker client for version 1.20: version detection using MinAPIVersion: unsupported version: 1.20
2018-01-22T07:17:04Z [INFO] Unable to get Docker client for version 1.21: version detection using MinAPIVersion: unsupported version: 1.21
2018-01-22T07:17:04Z [INFO] Unable to get Docker client for version 1.22: version detection using MinAPIVersion: unsupported version: 1.22
2018-01-22T07:17:04Z [INFO] Unable to get Docker client for version 1.23: version detection using MinAPIVersion: unsupported version: 1.23
2018-01-22T07:17:06Z [INFO] Loading state! module="statemanager"
2018-01-22T07:17:06Z [INFO] Event stream ContainerChange start listening...
2018-01-22T07:17:06Z [INFO] Restored cluster 'cluster-test'
2018-01-22T07:17:07Z [INFO] Restored from checkpoint file. I am running as 'arn:aws:ecs:ap-southeast-1:667464071995:container-instance/9f934c10-4ff2-4195-9387-3ee23deedc26' in cluster 'cluster-test'
2018-01-22T07:17:08Z [INFO] Registered container instance with cluster!
2018-01-22T07:17:09Z [INFO] Saving state! module="statemanager"
2018-01-22T07:17:09Z [INFO] Beginning Polling for updates
2018-01-22T07:17:09Z [INFO] Waiting for any previous stops to complete module="TaskEngine" task="task-test-cluster1:1 arn:aws:ecs:ap-southeast-1:667464071995:task/bd579187-1017-436c-9bdc-4bb178e6be9a, TaskStatus: (NONE->RUNNING) Containers: [container-test (NONE->RUNNING),]" seqnum="2"
2018-01-22T07:17:09Z [INFO] Wait over; ready to move towards status: RUNNING module="TaskEngine" task="task-test-cluster1:1 arn:aws:ecs:ap-southeast-1:667464071995:task/bd579187-1017-436c-9bdc-4bb178e6be9a, TaskStatus: (NONE->RUNNING) Containers: [container-test (NONE->RUNNING),]"
2018-01-22T07:17:09Z [INFO] Pulling container container-test(667464071995.dkr.ecr.ap-southeast-1.amazonaws.com/test:1.0) (NONE->RUNNING) concurrently. Task: task-test-cluster1:1 arn:aws:ecs:ap-southeast-1:667464071995:task/bd579187-1017-436c-9bdc-4bb178e6be9a, TaskStatus: (NONE->RUNNING) Containers: [container-test (NONE->RUNNING),]
2018-01-22T07:17:09Z [INFO] Event stream DeregisterContainerInstance start listening...
2018-01-22T07:17:09Z [INFO] Initializing stats engine
2018-01-22T07:17:10Z [ERROR] Error running http api: listen tcp 127.0.0.1:51679: bind: Only one usage of each socket address (protocol/network address/port) is normally permitted.
2018-01-22T07:17:10Z [ERROR] Error running http api module="Handlers" err="listen tcp :51678: bind: Only one usage of each socket address (protocol/network address/port) is normally permitted."
2018-01-22T07:17:10Z [INFO] NO_PROXY set:XXXXXXX,XXXXXXXX2,//./pipe/docker_engine
2018-01-22T07:17:11Z [INFO] Connected to ACS endpoint
2018-01-22T07:17:11Z [INFO] Connected to TCS endpoint
2018-01-22T07:17:11Z [ERROR] Error running http api: listen tcp 127.0.0.1:51679: bind: Only one usage of each socket address (protocol/network address/port) is normally permitted.
2018-01-22T07:17:13Z [INFO] ACS Websocket connection closed for a valid reason
2018-01-22T07:17:13Z [INFO] Connected to ACS endpoint
2018-01-22T07:17:14Z [ERROR] Error running http api: listen tcp 127.0.0.1:51679: bind: Only one usage of each socket address (protocol/network address/port) is normally permitted. 

The same log repeating. Can anyone help on this.

aaithal commented 6 years ago

Hi @praveen070290, we're sorry to hear that you're running into this issue. Tasks can remain in PENDING state for a number of reasons. In this case, it seems to be because the ECS agent is failing in its initialization as it needs to bind to the port 51679 on the instance. Are you running any additional scripts/processes which might be bound to that port, thus preventing ECS agent from binding to it? If not, it might be an issue with how ECS agent is being restarted. In either case, I'd encourage you to create a new issue, with details of your environment (how is the ECS agent installed, which AMI are you using, which ECS agent version are you using etc). You can find more details about setting up a windows container instance here. Please note that our instructions have changed recently as the windows ECS support moved out of beta into general availability.

Thanks, Anirudh

praveen070290 commented 6 years ago

Hi @aaithal, I have deleted my old instances and create a fresh one and just check the log console and still it showing the same, I am not using any extra scripts in it. Also i notice that the instance doesn't have a folder for EC2ConfigService and missing software EC2ConfigService . I am planing to install that and add the cloud watch to check the exact issue.

I checked the port availability and this is how it showing.. image

praveen070290 commented 6 years ago

Hi @aaithal, I enable all the logs and in the Ec2config i saw some errors as follows,


> 2018-01-23T06:15:36.384Z: Failed to fetch instance metadata http://169.254.169.254/latest/meta-data/elastic-gpus/associations with exception The remote server returned an error: (404) Not Found.
> 2018-01-23T06:15:36.393Z: Starting execution of script 'C:\Program Files\Amazon\Ec2ConfigService\Scripts\Send-FeatureStatus.ps1'.
> 2018-01-23T06:15:36.403Z: Ec2ElasticGpuSetup: Access to ElasticGPU metadata failed with error: The remote server returned an error: (404) Not Found.
> 2018-01-23T06:15:36.415Z: Ec2ElasticGpuSetup: ElasticGPUs not available on this instance

And based on this errors i got a solution, that is to run the following command in the power shell and finaly it works.

**Start-Transcript -Path "C:\egpu_install.log" -Append
(new-object net.webclient).DownloadFile('http://ec2-elasticgpus.s3-website-us-east-1.amazonaws.com/latest', 'C:\egpu.msi')
Start-Process "msiexec.exe" -Wait -ArgumentList "/i C:\egpu.msi /qn /L*v C:\egpu_msi_install.log"
[Environment]::SetEnvironmentVariable("Path", $env:Path + ";C:\Program Files\Amazon\EC2ElasticGPUs\manager\", [EnvironmentVariableTarget]::Machine)
Restart-Computer -Force**
praveen070290 commented 6 years ago

Hi @aaithal , I am facing the same port issue today, I don't know why this happening.It would be great if you can help me out with some effective solutions...

For this port issue, i did some registry edit too to set a time wait which is also not fixed.

Error running http api: listen tcp 127.0.0.1:51679: bind: Only one usage of each socket address (protocol/network address/port) is normally permitted.

aaithal commented 6 years ago

Hi @praveen070290, I am not sure what your setup is here or where you're getting this error from based on what you have posted here. The only related information I can think of here is that ECS agent binds to ports 51678 and 51679 on start-up. I'd also suggest getting in touch with AWS support if you continue running into issues. Thanks!

wivarn commented 6 years ago

Hi @aaithal We having the same problem. We're using an image based off of the amazon-ecs-optimized ami for our instance. The only real difference is our image is encrypted and we have 100g instead of 22g The ecs agent version is 1.17.1. With the default ECS_CONTAINER_STOP_TIMEOUT of 30 seconds, we usually have between 0-1 pending tasks in our cluster. Any new tasks starts fairly quickly. When we bump ECS_CONTAINER_STOP_TIMEOUT to 30 minutes, we have several pending tasks most of the time. New tasks take a long time to transition from pending to running. The new tasks aren't permanently stuck in pending but they take an unreasonably long time to start.

The reason why we need a high ECS_CONTAINER_STOP_TIMEOUT is because we have some tasks processing some background jobs that could take a while. We've programmed these tasks to stop taking on new jobs when it receive a SIGTERM but the task will still finish a job in progress then gracefully terminate. All the other tasks in the cluster behave the same way. We have tested each task to ensure they do indeed gracefully terminate as soon as possible.

I can rule out some of the causes already posted in this issue report.

It also doesn't seem to matter how the task is started. We have tasks that are started with the "Force new deployment" checkbox in the service UI, scheduled tasks, and manually started tasks.

I've just turned on debug level logging on the ec2 instance. I can send you these logs if they help.

aaithal commented 6 years ago

Hi @Wivarn, setting ECS_CONTAINER_STOP_TIMEOUT to something like 30m means that there'll be a 30m delay between sending SIGTERM and SIGKILL (See documentation for details). ECS agent waits for all STOPPING tasks to be STOPPED before starting new ones, which is the likely reason you're seeing this delay. We've an open feature request for doing this on a per task basis as well: #855. We're also exploring if we can avoid doing this in the ECS agent altogether with some changes to ECS backend as well. If you do really need this 30m grace period, you could also look at separating other tasks and placing them on instances where this setting is not set.

I'd also encourage you to create a new issue for any subsequent discussion as this issue has become a catch-all thread for multiple issues.

neilpalima commented 4 years ago

I encountered this issue. I have 4 tasks that will allocate to all cpu and ~90% of the memory of the cluster. I stopped all these tasks. The tasks where stopped but my new task was stucked in PENDING.

It became RUNNING when I reboot the ec2 cluster instance.