aws / amazon-ecs-agent

Amazon Elastic Container Service Agent
http://aws.amazon.com/ecs/
Apache License 2.0
2.08k stars 612 forks source link

1.29.0 CannotPullContainerError pull access denied requiring multiple host destruction #2137

Closed eedwards-sk closed 3 years ago

eedwards-sk commented 5 years ago

Summary

After pushing a new ECR image and attempting to deploy, instance started failing to pull image from ECR, even with no permissions changes.

Description

Status reason | CannotPullContainerError: Error response from daemon: pull access denied for FOO.dkr.ecr.us-east-1.amazonaws.com/foo/app, repository does not exist or may require 'docker login'

3 different ECS services, across 2 different hosts, all started showing this error upon attempting to launch new tasks during a deployment.

I confirmed I could pull the image locally on my own workstation when performing a docker login to ECR.

No permissions changes were made to the instance.

The ecs tasks all have task execution roles that have the proper ECS Role Policy attached.

The ecs hosts themselves have the proper ECS permissions in their instance roles.

After re-creating the hosts, they pulled fine.

I'm trying to figure out what state the instance got into and how to resolve it in the future. It should not be failing to pull ECR images.

I'm opening this to start it as a point for tracking these issues in the future, as extensive searching did not surface anyone else having this same issue.

Environment Details

ubuntu bionic 18.04 ecs agent 1.29.0

shubham2892 commented 5 years ago

@eedwards-sk Is it possible the that Docker logged out in the instance, do you have Docker logs from the hosts in which this happened. Also, can you email me Date/time of when the issue occurred at shugy at amazon dot com.

eedwards-sk commented 5 years ago

@shubham2892 Thanks for your response.

Unfortunately I don't persist docker agent logs from the hosts to any external store, so any of that detail was lost the moment I re-created the hosts (prior to opening the issue).

During investigation I did check the logs (syslog and the docker daemon as well as ecs-agent) and nothing obvious was surfaced, but I was also focused on recovering services more than the root cause.

Considering this was happening at the same time across multiple hosts, I'm not sure it was an issue specific to one host.

For the time / date, the closest I can approximate is when the replacement hosts came online, which was July 18th 6:56 PM CDT (July 18th 11:56 PM UTC)

pasanw commented 5 years ago

@shubham2892 I also ran into this issue. Without any change to the cluster instances, pulls began to fail across all instances with the exact same error reported above. These instances were online and pulling images fine for just over a month.

We are running the containerized agent (1.29.1) on our cluster instances. The issue can be resolved by restarting the agent container on each instance.

yumex93 commented 5 years ago

I will try to see whether I can reproduce the issue on my side. However, @pasanw do you have the agent debug level logs during the time period having the issue. If you still have it, could you send the log to yumex at amazon dot com. It would help a lot for debugging the issue.

pasanw commented 5 years ago

Thanks @yumex93, I've sent you logs via e-mail. Unfortunately the log level was INFO, hope it can still be of use. Let me know if it successfully reached your inbox.

yumex93 commented 5 years ago

@pasanw Got your email. Thanks!

yumex93 commented 5 years ago

@pasanw Sorry. The log did not provide enough information that can help me figure out whether this can be a credential issue, a docker issue or just an issue on ECR side. Just wanna to collect more information, before the issue happens, did you try to update your image on ECR? Besides, would you be able to turn on debug level and use ecs log collector to collect logs and send to us when it happens again. This will get more info that can help us dig into the problems.

eedwards-sk commented 5 years ago

@yumex93 unfortunately this issue does not recur enough to make enabling debug logging all the time feasible (it hasn't yet recurred for us)

when the issue occurred for us, we had just pushed a new image and task and service and ecs was updating it across some hosts

fcoelho commented 5 years ago

@yumex93 I'm also having issues with this. Had to recreate the instance to get things going again, and this just happened one more time. There's a few messages from docker that look like this:

Aug 29 19:30:44 ip-xx-xx-xx-xx dockerd: time="2019-08-29T19:30:44.909653497Z" level=error msg="Not continuing with pull after error: denied: The security token included in the request is invalid."

I've ran the ecs log collector program though before enabling debug mode. Can I send the results to you as well?

I've enabled debug mode on this instance and will report back with new logs if this happens again, but the log collector had to restart docker to enable debug mode and things seem to be working as expected now. for reference, this instance is a c5.large with eni trunking enabled using ami ami-02bf9e90a6e30dc74 in eu-west-1

yhlee-aws commented 5 years ago

@fcoelho please feel free to send it to either @yumex93 or myself (yhlee at amazon dot com).

yhlee-aws commented 5 years ago

@fcoelho thank you for sending the logs, I've replied to your email!

radamisa commented 5 years ago

We've also encountered this multiple times with ECS Agent 1.32.0 running on Amazon Linux AMI 2.0.20191014 x86_64 ECS HVM GP2 in us-east-1.

hlarsen commented 5 years ago

we've also seen it happen a few times, though we're just now getting back to the projects involving ECS so it's becoming more of a concern.

is the workaround to restart the agent on each ec2 instance? is there any more public information about how or why this is happening?

i'd obviously love to do nothing rather than have to worry about restarting the agent.

cyastella commented 5 years ago

Hi, Sorry for facing the issue. The speculation is if CFN/Terraform stack is used for resources deletion and recreation, the task execution role with the same role ARN is deleted and recreated. Since currently when agent caches the credentials of the role, it uses the combination of region, roleARN, registryID, endpointOverride as the key. Which means, as such case, agent will use the credentials from the cache rather than the credentials of the newly created role. This is a known issue and we will work on it in the future.

For now as a workaround, if such an error occurs, we suggest you restart agent on the instance manually after deleting the stack, which will clear the cache, then recreate the CFN/Terraform stack. Another workaround will be not specify the execution role name when create the role.

Thanks

mikalai-t commented 4 years ago

@cyastella Thank you so much!

jungtaek-raphael commented 4 years ago

@cyastella Thank you!! You saved my life

fierlion commented 4 years ago

note: if you face this issue and are willing to help out here with more logs, please use ecs logs collector and send to ecs-agent-external at amazon dot com.

We are actively looking into clarifying the root cause and repro of this scenario.

dzmitry-kankalovich commented 4 years ago

I am just going to add to the workaround said by @cyastella: If you have a named Exec Role and prefer to keep it that way, then even a slight change in naming would also work, for example you can append v2 to the role name.

UlrichEckhardt commented 3 years ago

I'm not sure if you still have problems reproducing this. However, I think the way is to simply

The second and third steps is what I always did to make sure that the stack configuration was proper (I'm a beginner with CFN) and which then caused the issues when ECR tried to deploy on EC2 instances that had the role credentials cached already. The pattern that only some instances were affected was actually what I noticed first.

kyleian commented 3 years ago

@fierlion Sent you some logs to that specified email address.

Conditions/workflow for reproducing:

Something I noticed in the ECS UI in my current working example, is I do see that the task actually does pull the image occasionally, eg state == "RUNNING", however the error is still manifested in the Status Reason: 

CannotPullContainerError: Error response from daemon: pull access denied for $REDACTED_REGISTRY_ID, repository does not exist or may require 'docker login': denied: The security token included in the request is in

It is unclear to me the difference for why occasionally it'll manage to pull the image with RUNNING state but still display the CannotPullContainerError, but the majority of the time I see this we're not able to pull the image when that error is displayed (perhaps within the timing of the token from the original role still being actually valid? The majority of the time I have seen this, it is likely the token is actually expired for the old role that was deleted, am attempting to replicate that case currently, will send separately as a followup.)

Our workaround for now will be to separate out the IAM roles into a separate stack to prevent them from being destroyed on an ecs-service stack delete/rollback.

awons commented 3 years ago

@kyleian For us the reason why it managed to start the task was because the image already existed on the instance. So it looks as if it still uses the existing image but tries to pull the latest one anyway. And hence the error. At least that's our assumption.

fierlion commented 3 years ago

@awons I imagine that if you were to test with ECS_IMAGE_PULL_BEHAVIOR=always it would force the image to be pulled, and likely force the behavior, (while default will try to pull first, and fall back to the cache and prefer-cached would do the opposite.) https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-agent-config.html

Note, I'm working on a repro now based on the above steps (thanks all).

fierlion commented 3 years ago

I am able to reproduce this -- I've attached my template below as cloudformation.txt (.yml upload not allowed)

repro steps:

Create a generic VPC using the console's VPC wizard, This will have a public and a private subnet. We'll use the public subnet for this test. Create an ecs cluster and launch an EC2 using the latest ECS Optimized AMI in the above VPC's public subnet.

update the ecs config on the ec2 instance, add ECS_IMAGE_PULL_BEHAVIOR=always.

Also, you'll need to create an ECR busybox repo and upload the latest busybox image to it. Replace the image in the supplied taskdef with your busybox ECR image.

The supplied cloud formation stack template will create the task/service/log group but more importantly will create the task IAM role. When you tear down and re-create the stack, if you don't update the task IAM role name, then you'll hit the error.

Run the following command to create your service/task/etc:

  $ aws cloudformation create-stack --stack-name ecs-shell --template-body file://./ecs-shell.yml --capabilities CAPABILITY_NAMED_IAM --parameters 'ParameterKey=SubnetID,ParameterValue=<subnet-id>' 'ParameterKey=Cluster,ParameterValue=<cluster-id>' 'ParameterKey=SecurityGroup,ParameterValue=<subnet-security-group>'

cloudformation.txt

I'm now looking into what's happening with the task IAM credentials expiration between stack delete and create.

shubham2892 commented 3 years ago

Upon further investigation we have come to know that when an IAM role is deleted and recreated with the same name, the ec2 instance associated with the role will no longer be able to use the permissions granted though that role, this is the expected behavior(https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_manage_delete.html). As mentioned in the comments above, it is recommended that each time you launch a CloudFormation template, you give your IAM role(s) a unique name.

hlarsen commented 3 years ago

Upon further investigation we have come to know that when an IAM role is deleted and recreated with the same name, the ec2 instance associated with the role will no longer be able to use the permissions granted though that role, this is the expected behavior

@shubham2892 how is this accurate when we can simply restart the ecs-agent to fix the issue?

Since currently when agent caches the credentials of the role, it uses the combination of region, roleARN, registryID, endpointOverride as the key. Which means, as such case, agent will use the credentials from the cache rather than the credentials of the newly created role. This is a known issue and we will work on it in the future.

the above explanation was given to us quite a while ago, is it no longer accurate?

everything points to this being an issue with the ecs-agent caching IAM permissions, not with EC2/IAM - i'm happy to be corrected if i'm wrong here.

jpcope commented 1 year ago

For anyone else finding this: if one alters the iam role on an ec2 instance such that the old role no longer exists, one MUST refresh the ec2 instances or restart the ecs-agent. The ecs-agent does not appear to automatically discover that a new role is being used on an existing ec2 instance.

If one runs into this using cloudformation, note that AWS::IAM::InstanceProfile resource type's Roles parameter updates without replacement. So if there is some other resource receiving a property from that resource type that "on update requires replacement" (Launch Configs) or "on update performs an instance refresh" (Launch Templates), one will need to update a property on the AWS::IAM::InstanceProfile resource type that on update requires replacement at the same time to get the desired rotation effect. Note that all the other properties of AWS::IAM::InstanceProfile trigger a physical resource replacement, just not Roles.

shubmanh commented 1 year ago

Hi team, is there any update on the fix for this issue?

I'm using the latest ECS agent version 1.69.0 and still facing this issue.

andrey-autofi commented 2 weeks ago

Still facing the same problem, ECS Agent v1.86.3 Please do something!