Closed eedwards-sk closed 3 years ago
@eedwards-sk Is it possible the that Docker logged out in the instance, do you have Docker logs from the hosts in which this happened. Also, can you email me Date/time of when the issue occurred at shugy at amazon dot com.
@shubham2892 Thanks for your response.
Unfortunately I don't persist docker agent logs from the hosts to any external store, so any of that detail was lost the moment I re-created the hosts (prior to opening the issue).
During investigation I did check the logs (syslog and the docker daemon as well as ecs-agent) and nothing obvious was surfaced, but I was also focused on recovering services more than the root cause.
Considering this was happening at the same time across multiple hosts, I'm not sure it was an issue specific to one host.
For the time / date, the closest I can approximate is when the replacement hosts came online, which was July 18th 6:56 PM CDT (July 18th 11:56 PM UTC)
@shubham2892 I also ran into this issue. Without any change to the cluster instances, pulls began to fail across all instances with the exact same error reported above. These instances were online and pulling images fine for just over a month.
We are running the containerized agent (1.29.1) on our cluster instances. The issue can be resolved by restarting the agent container on each instance.
I will try to see whether I can reproduce the issue on my side. However, @pasanw do you have the agent debug level logs during the time period having the issue. If you still have it, could you send the log to yumex at amazon dot com. It would help a lot for debugging the issue.
Thanks @yumex93, I've sent you logs via e-mail. Unfortunately the log level was INFO, hope it can still be of use. Let me know if it successfully reached your inbox.
@pasanw Got your email. Thanks!
@pasanw Sorry. The log did not provide enough information that can help me figure out whether this can be a credential issue, a docker issue or just an issue on ECR side. Just wanna to collect more information, before the issue happens, did you try to update your image on ECR? Besides, would you be able to turn on debug level and use ecs log collector to collect logs and send to us when it happens again. This will get more info that can help us dig into the problems.
@yumex93 unfortunately this issue does not recur enough to make enabling debug logging all the time feasible (it hasn't yet recurred for us)
when the issue occurred for us, we had just pushed a new image and task and service and ecs was updating it across some hosts
@yumex93 I'm also having issues with this. Had to recreate the instance to get things going again, and this just happened one more time. There's a few messages from docker that look like this:
Aug 29 19:30:44 ip-xx-xx-xx-xx dockerd: time="2019-08-29T19:30:44.909653497Z" level=error msg="Not continuing with pull after error: denied: The security token included in the request is invalid."
I've ran the ecs log collector program though before enabling debug mode. Can I send the results to you as well?
I've enabled debug mode on this instance and will report back with new logs if this happens again, but the log collector had to restart docker to enable debug mode and things seem to be working as expected now. for reference, this instance is a c5.large
with eni trunking enabled using ami ami-02bf9e90a6e30dc74
in eu-west-1
@fcoelho please feel free to send it to either @yumex93 or myself (yhlee at amazon dot com).
@fcoelho thank you for sending the logs, I've replied to your email!
We've also encountered this multiple times with ECS Agent 1.32.0 running on Amazon Linux AMI 2.0.20191014 x86_64 ECS HVM GP2
in us-east-1
.
we've also seen it happen a few times, though we're just now getting back to the projects involving ECS so it's becoming more of a concern.
is the workaround to restart the agent on each ec2 instance? is there any more public information about how or why this is happening?
i'd obviously love to do nothing rather than have to worry about restarting the agent.
Hi, Sorry for facing the issue. The speculation is if CFN/Terraform stack is used for resources deletion and recreation, the task execution role with the same role ARN is deleted and recreated. Since currently when agent caches the credentials of the role, it uses the combination of region, roleARN, registryID, endpointOverride as the key. Which means, as such case, agent will use the credentials from the cache rather than the credentials of the newly created role. This is a known issue and we will work on it in the future.
For now as a workaround, if such an error occurs, we suggest you restart agent on the instance manually after deleting the stack, which will clear the cache, then recreate the CFN/Terraform stack. Another workaround will be not specify the execution role name when create the role.
Thanks
@cyastella Thank you so much!
@cyastella Thank you!! You saved my life
note: if you face this issue and are willing to help out here with more logs, please use ecs logs collector and send to ecs-agent-external at amazon dot com.
We are actively looking into clarifying the root cause and repro of this scenario.
I am just going to add to the workaround said by @cyastella:
If you have a named Exec Role and prefer to keep it that way, then even a slight change in naming would also work, for example you can append v2
to the role name.
I'm not sure if you still have problems reproducing this. However, I think the way is to simply
The second and third steps is what I always did to make sure that the stack configuration was proper (I'm a beginner with CFN) and which then caused the issues when ECR tried to deploy on EC2 instances that had the role credentials cached already. The pattern that only some instances were affected was actually what I noticed first.
@fierlion Sent you some logs to that specified email address.
Conditions/workflow for reproducing:
Something I noticed in the ECS UI in my current working example, is I do see that the task actually does pull the image occasionally, eg state == "RUNNING", however the error is still manifested in the Status Reason:
CannotPullContainerError: Error response from daemon: pull access denied for $REDACTED_REGISTRY_ID, repository does not exist or may require 'docker login': denied: The security token included in the request is in
It is unclear to me the difference for why occasionally it'll manage to pull the image with RUNNING state but still display the CannotPullContainerError, but the majority of the time I see this we're not able to pull the image when that error is displayed (perhaps within the timing of the token from the original role still being actually valid? The majority of the time I have seen this, it is likely the token is actually expired for the old role that was deleted, am attempting to replicate that case currently, will send separately as a followup.)
Our workaround for now will be to separate out the IAM roles into a separate stack to prevent them from being destroyed on an ecs-service stack delete/rollback.
@kyleian For us the reason why it managed to start the task was because the image already existed on the instance. So it looks as if it still uses the existing image but tries to pull the latest one anyway. And hence the error. At least that's our assumption.
@awons
I imagine that if you were to test with ECS_IMAGE_PULL_BEHAVIOR=always
it would force the image to be pulled, and likely force the behavior, (while default
will try to pull first, and fall back to the cache and prefer-cached
would do the opposite.)
https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-agent-config.html
Note, I'm working on a repro now based on the above steps (thanks all).
I am able to reproduce this -- I've attached my template below as cloudformation.txt (.yml upload not allowed)
repro steps:
Create a generic VPC using the console's VPC wizard, This will have a public and a private subnet. We'll use the public subnet for this test. Create an ecs cluster and launch an EC2 using the latest ECS Optimized AMI in the above VPC's public subnet.
update the ecs config on the ec2 instance, add ECS_IMAGE_PULL_BEHAVIOR=always
.
Also, you'll need to create an ECR busybox repo and upload the latest busybox image to it. Replace the image in the supplied taskdef with your busybox ECR image.
The supplied cloud formation stack template will create the task/service/log group but more importantly will create the task IAM role. When you tear down and re-create the stack, if you don't update the task IAM role name, then you'll hit the error.
Run the following command to create your service/task/etc:
$ aws cloudformation create-stack --stack-name ecs-shell --template-body file://./ecs-shell.yml --capabilities CAPABILITY_NAMED_IAM --parameters 'ParameterKey=SubnetID,ParameterValue=<subnet-id>' 'ParameterKey=Cluster,ParameterValue=<cluster-id>' 'ParameterKey=SecurityGroup,ParameterValue=<subnet-security-group>'
I'm now looking into what's happening with the task IAM credentials expiration between stack delete and create.
Upon further investigation we have come to know that when an IAM role is deleted and recreated with the same name, the ec2 instance associated with the role will no longer be able to use the permissions granted though that role, this is the expected behavior(https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_manage_delete.html). As mentioned in the comments above, it is recommended that each time you launch a CloudFormation template, you give your IAM role(s) a unique name.
Upon further investigation we have come to know that when an IAM role is deleted and recreated with the same name, the ec2 instance associated with the role will no longer be able to use the permissions granted though that role, this is the expected behavior
@shubham2892 how is this accurate when we can simply restart the ecs-agent to fix the issue?
Since currently when agent caches the credentials of the role, it uses the combination of region, roleARN, registryID, endpointOverride as the key. Which means, as such case, agent will use the credentials from the cache rather than the credentials of the newly created role. This is a known issue and we will work on it in the future.
the above explanation was given to us quite a while ago, is it no longer accurate?
everything points to this being an issue with the ecs-agent caching IAM permissions, not with EC2/IAM - i'm happy to be corrected if i'm wrong here.
For anyone else finding this: if one alters the iam role on an ec2 instance such that the old role no longer exists, one MUST refresh the ec2 instances or restart the ecs-agent. The ecs-agent does not appear to automatically discover that a new role is being used on an existing ec2 instance.
If one runs into this using cloudformation, note that AWS::IAM::InstanceProfile
resource type's Roles parameter updates without replacement. So if there is some other resource receiving a property from that resource type that "on update requires replacement" (Launch Configs) or "on update performs an instance refresh" (Launch Templates), one will need to update a property on the AWS::IAM::InstanceProfile
resource type that on update requires replacement
at the same time to get the desired rotation effect. Note that all the other properties of AWS::IAM::InstanceProfile
trigger a physical resource replacement, just not Roles
.
Hi team, is there any update on the fix for this issue?
I'm using the latest ECS agent version 1.69.0
and still facing this issue.
Still facing the same problem, ECS Agent v1.86.3
Please do something!
Summary
After pushing a new ECR image and attempting to deploy, instance started failing to pull image from ECR, even with no permissions changes.
Description
Status reason | CannotPullContainerError: Error response from daemon: pull access denied for FOO.dkr.ecr.us-east-1.amazonaws.com/foo/app, repository does not exist or may require 'docker login'
3 different ECS services, across 2 different hosts, all started showing this error upon attempting to launch new tasks during a deployment.
I confirmed I could pull the image locally on my own workstation when performing a
docker login
to ECR.No permissions changes were made to the instance.
The ecs tasks all have task execution roles that have the proper ECS Role Policy attached.
The ecs hosts themselves have the proper ECS permissions in their instance roles.
After re-creating the hosts, they pulled fine.
I'm trying to figure out what state the instance got into and how to resolve it in the future. It should not be failing to pull ECR images.
I'm opening this to start it as a point for tracking these issues in the future, as extensive searching did not surface anyone else having this same issue.
Environment Details
ubuntu bionic 18.04 ecs agent 1.29.0