Open dsalamancaMS opened 3 years ago
Adding a note to say that in some contexts, this can truncate literally the most important parts of an error, example given:
ResourceInitializationError: unable to pull secrets or registry auth: execution resource retrieval failed: unable to retrieve secret from asm: service call has been retried 1 time(s): failed to fetch secret arn:aws:secretsmanager:us-east-1:xxxxxxxxxxx...
Something similar happened to me recently:
CannotPullContainerError: containerd: pull command failed: time="2021-XX-XXTXX:XX:XXZ" level=info msg="apply failure, attempting cleanup" error="failed to extract layer sha256:XXXXXXXX: write /var...
These truncated messages are very annoying since the most useful part of the reason is not there. Also, AWS support is also not able to figure out the reason. I am still looking for the reason why my tasks keep getting STOPPED.
Any update on this ticket? We are also running into this. It makes it very hard to debug operational issues in Fargate.
This is very annoying, I have like 20 secrets and I don't know which one failed to be pulled.
+1. Running across this very same issue, and it's frustrating not being able to tell what just caused my task to fail.
Any update on this ticket ? Running into this same issue where the stopped reason is truncated and make it very difficult to investigate exactly what is the cause of the error that made the task stop...
Running into this same issue... This issue is temporarily solved after I deleted all images in ECR and recreated one yesterday. However, this happens today. This is so frustrating...
In case this helps anyone else, in my case I knew my container had outbound internet access (so the issue wasn't subnet/ IGW related) which meant it could only be something wrong with the IAM policy I'd used for my task.
I double checked my secret ARN and realised that AWS had helpfully appended a '-aBcD' string onto the end of my secret name (I'd assumed that the ARN would just end with the secret name I'd specified...) so I updated my policy and it's working fine.
ResourceInitializationError: unable to pull secrets or registry auth: execution resource retrieval failed: unable to retrieve ecr registry auth: service call has been retried 1 time(s): InvalidParameterException: Invalid parameter at 'registryIds' fail...
what does it means ? can somebody please suggest me i have been facing this issue from last 3 days. before it was working fine in all 13 cluster with same configuration. now i tried making one more cluster with 1.4.0 fargate and this issue came in. now 8 clusters showing this error.. i have tried everything on internet but issue still remains. any lead please ...
Something similar happened to me recently:
CannotPullContainerError: containerd: pull command failed: time="2021-XX-XXTXX:XX:XXZ" level=info msg="apply failure, attempting cleanup" error="failed to extract layer sha256:XXXXXXXX: write /var...
These truncated messages are very annoying since the most useful part of the reason is not there. Also, AWS support is also not able to figure out the reason. I am still looking for the reason why my tasks keep getting STOPPED.
Hi Kunal, were you able to resolve this error? I am also facing the same, and am clueless on how to resolve this.
Hi @carthic1 @kunalsawhney , I faced the same issue:
Something similar happened to me recently:
CannotPullContainerError: containerd: pull command failed: time="2021-XX-XXTXX:XX:XXZ" level=info msg="apply failure, attempting cleanup" error="failed to extract layer sha256:XXXXXXXX: write /var...
These truncated messages are very annoying since the most useful part of the reason is not there. Also, AWS support is also not able to figure out the reason. I am still looking for the reason why my tasks keep getting STOPPED.
The main reason for my case was the size of the Docker image that my Fargate task was trying to pull.
The size of my image is really big (+21GB) and the current limit of storage in a Fargate task is 20GB, looking for the AWS doc I found the EphemeralStorage parameter of a ECS Task Definition, adding a considerable size solved the issue:
EphemeralStorage:
SizeInGiB: 30
I hope this helps you
References: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-ecs-taskdefinition-ephemeralstorage.html https://docs.aws.amazon.com/AmazonECS/latest/developerguide/fargate-task-storage.html
@carthic1 the issue is due to lack of storage in the ECS tasks. AWS has launched new feature of being able to attach EphemeralStorage of upto 200 GB to your tasks. You can use this capability and increase your task storage.
@carthic1 it was a storage problem with me, and solved it as @marcoaal and @kunalsawhney mentioned. If you are using aws-copilot, just add these two lines in the manifest.yml file for the task
storage:
ephemeral: 35
35 can be changed up to 200 GB of ephemeral storage
Is there anywhere in all of AWS that doesn't truncate this message? Can I run a CLI command, for instance, that will pull the full message? It's really an absurdly short character limit for the purpose.
@dezren39 no, it's truncated everywhere. it's pretty absurd
Hi,
I am also getting this error: CannotPullContainerError: containerd: pull command failed: time="2021-08-20T13:58:49Z" level=info msg="apply failure, attempting cleanup" error="failed to extract layer sha256:e246d4b4c5f108af0f72da900f45ae9a37e1d184d8d605ab4117293b6990b7b8: write /var... Network
I am trying to add ephemeral storage using console in task definition by clicking on "Configure via json", and adding the below lines: "ephemeralStorage": { "sizeInGiB": "25" }
but now getting the error: Should only contain "family", "containerDefinitions", "volumes", "taskRoleArn", "networkMode", "requiresCompatibilities", "cpu", "memory", "inferenceAccelerators", "executionRoleArn", "pidMode", "ipcMode", "proxyConfiguration", "tags", "placementConstraints"
I can't add extra storage now. Can anyone help me here ?
Hi @marcoaal @kunalsawhney @mohamedFaris47 How were you able to resolve the issue ? Can you please help me resolve this issue below ?
Hi,
I am also getting this error: CannotPullContainerError: containerd: pull command failed: time="2021-08-20T13:58:49Z" level=info msg="apply failure, attempting cleanup" error="failed to extract layer sha256:e246d4b4c5f108af0f72da900f45ae9a37e1d184d8d605ab4117293b6990b7b8: write /var... Network
I am trying to add ephemeral storage using console in task definition by clicking on "Configure via json", and adding the below lines: "ephemeralStorage": { "sizeInGiB": "25" }
but now getting the error: Should only contain "family", "containerDefinitions", "volumes", "taskRoleArn", "networkMode", "requiresCompatibilities", "cpu", "memory", "inferenceAccelerators", "executionRoleArn", "pidMode", "ipcMode", "proxyConfiguration", "tags", "placementConstraints"
I can't add extra storage now. Can anyone help me here ?
Hi @adesgautam,
The extra ephemeralStorage storage for the Fargate tasks cannot be configured through the console.
You can refer to this doc for details on what all options are supported: https://aws.amazon.com/about-aws/whats-new/2021/04/amazon-ecs-aws-fargate-configure-size-ephemeral-storage-tasks/
It clearly states that you can use any of "AWS Copilot CLI, CloudFormation, AWS SDK, and AWS CLI"
This gave me an "amazing" user experience today, what a shame :)
Any update on this guys?
If the error message has shown in full then we can save a lot of time to build more things with AWS instead of going round and round finding for the exact error.
this ussue was registered on my 60th birthday. it is now 1 and I am 61
This is a very frustrating issue.
sashoalm on StackOverflow suggested that the full error message might be found in CloudTrail, but that didn't work for me: https://stackoverflow.com/questions/66919512/stoppedreason-in-ecs-fargate-is-truncated
A fix or a workaround would be great
I've also come across this issue several times. The last of which (today) resulted in significant troubleshooting of IAM Roles and SSM Secrets because the first have of the error was regarding retrieving a 'secretsmanager' ARN. However, after a few hours of TS and eventually going to AWS Support, the issue was actually a Networking issue because the IGW was offline.
Once the last portion of the error message was found by the engineer, I saw the context timeout error message and knew exactly what it was. Please fix this, it is very frustrating.
This issue still exists. It would be nice to get some traction on such a small issue but it would really help in troubleshooting.
This is a very frustrating issue.
sashoalm on StackOverflow suggested that the full error message might be found in CloudTrail, but that didn't work for me: https://stackoverflow.com/questions/66919512/stoppedreason-in-ecs-fargate-is-truncated
A fix or a workaround would be great
Thank you. This is working for me :)
Than I found the API call made by ECS to SSM.
"errorMessage": "User: arn:aws:sts::<accountId>:assumed-role/ecsTaskExecutionRole/0ba9a209db2848ejafhh17567haj16 is not authorized to perform: ssm:GetParameters on resource: arn:aws:ssm:eu-central-1:<accountId>:parameter//sorry-cypress/minio_pw because no identity-based policy allows the ssm:GetParameters action",
My problem was what I used /${aws_ssm_parameter.sorry_cypress_mongo.name} in Terraform, because aws_ssm_parameter.sorry_cypress_mongo.name already starts with "/" so I ended up with "//" :)
In my case, this was due to a couple of missing permissions for the ECR pull-through cache. I ended up with a policy like this on my ECS task's execution role:
{
"Statement": [
{
"Action": [
"ecr:CreateRepository",
"ecr:BatchImportUpstreamImage"
],
"Effect": "Allow",
"Resource": [
"arn:aws:ecr:us-east-1:xxx:repository/ecr-public/xray/*",
"arn:aws:ecr:us-east-1:xxx:repository/ecr-public/cloudwatch-agent/*"
],
"Sid": ""
}
],
"Version": "2012-10-17"
}
@accdha this is off-topic for the issue. The error can literally be dozens of different things at least, no need to post them here. The issue is about ECS truncating the error message regardless of the error or its reason.
Keep in mind that every time you post here, you're sending an email to 25 people.
@accdha this is off-topic for the issue. The error can literally be dozens of different things at least, no need to post them here. The issue is about ECS truncating the error message regardless of the error or its reason.
Keep in mind that every time you post here, you're sending an email to 25 people.
If you look at the history, note that other people have been sharing non-obvious causes. The fix will be when the error messages arenât truncated, which is why I also raised it with our TAM, but in the meantime people often benefit from suggestions for additional points to review after theyâve exhausted the most obvious options.
@accdha this is off-topic for the issue. The error can literally be dozens of different things at least, no need to post them here. The issue is about ECS truncating the error message regardless of the error or its reason.
Keep in mind that every time you post here, you're sending an email to 25 people.
If you look at the history, note that other people have been sharing non-obvious causes. The fix will be when the error messages arenât truncated, which is why I also raised it with our TAM, but in the meantime people often benefit from suggestions for additional points to review after theyâve exhausted the most obvious options.
Right, but all of these fixes are irrelevant to the containers roadmap issue here, which is about truncated errors.
17 months later, and still no progress? I came to this issue by way of an AWS support ticket, asking how to get the full error message. They pointed me at this github issue, so seems like the team is well aware of it. Funny thing is, AWS support was able to give me the full, non-truncated error message, so it is available somewhere.
My recommendation is to open a support ticket if you want the full error. Maybe that will help put some priority on this issue.
good idea
On Wed, 16 Mar 2022 at 23:37, Adam Lewandowski @.***> wrote:
17 months later, and still no progress? I came to this issue by way of an AWS support ticket, asking how to get the full error message. They pointed me at this github issue, so seems like the team is well aware of it. Funny thing is, AWS support was able to give me the full, non-truncated error message, so it is available somewhere.
My recommendation is to open a support ticket if you want the full error. Maybe that will help put some priority on this issue.
â Reply to this email directly, view it on GitHub https://github.com/aws/containers-roadmap/issues/1133#issuecomment-1069081541, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAREQLZE6Z55WQRVFDBQOLDVAHIX7ANCNFSM4TFPW4PQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
You are receiving this because you commented.Message ID: @.***>
I have come across this problem again, 11 months after posting here the last time. I wasted most of my day today chasing phantoms due to a clipped error message. Fortunately this particular cluster is EC2/ECS, so I was ultimately able to ssh in and rerun the docker commands to see the full, exceedingly helpful error message.
These error messages need to be exposed in full.
Here I am again... chasing ghosts and phantoms...
+1
+1
Keep in mind that every time you post here, you're sending an email to 25 29 people. You can just use đ on the issue instead of a new comment.
Fargate now supports longer error messages, increasing the length of the âstoppedReasonâ field in the ECS DescribeTasks API response from 256 to 1028 characters. This should make debugging for task failures easier, providing you with appropriate information around a failure and reducing the occurrence of error message truncation due to the limited length of âstoppedReasonâ field.
Still keeping this issue open for now to track if anyone continues to face the truncation issue. Please report any such occurrences here.
Looks like 1024 is not enough for some cases. We just hit this where the token as part of the url exceeds the limit:
CannotPullContainerError: ref pull has been retried 5 time(s): failed to copy: httpReadSeeker: failed open: failed to do request: Get "https://prod-eu-central-1-starport-layer-bucket.s3.eu-central-1.amazonaws.com/cedb17-026714298493-3ac1d3c9-4bbf-51f8-1fa2-2ec2d19d9a14/3ecdbcc6-70f2-48bc-b48f-40441f4dbcf0?X-Amz-Security-Token=IQoJb3JpZ2luX2VjEB0aDGV1LWNlbnRyYWwtMSJHMEUCIQDI33gO5Z6NzDamMOuri8r1rUoiYJ%2B5IAeLwBWGc7IiRAIgYZX%2FME2%2BWL3kN8GvKqlBtdZks3vxdGwvcW4cH%2B6V%2FIwqzgMI1v%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FARAFGgwxNTY5NjA3MTUzNzUiDMskpZ9B%2FM6%2FN%2F231SqiA0RekF%2FcPij5BR3JjJhmxJR4jjvlNlaGBLDIqIExaYB0q4IKAkeqplm8FErC8CT%2Bfm%2FNqiVzJgBAW1erEivpS0EpV0QBmq4yXFNBRnPS%2BSNG%2FI8OV23YsDep66aUDairHaKOsQEcGfT2BK4nuSGzbTxl%2FNhBh6uielnZQNB3p9AVGL%2BC5OT1unY67TFV%2FPmZcdk9cjQdMr07JYbZm7fox0Rs%2BBablGYIh%2FNg3qSo3gpbuBKNqx18T1zmkKQQLt6q2wk1UZE7O5BeN%2FaF%2FLmNOLpZHpVgZEuiuGZGB%2F6qNIKJs3SCuVArZVv801km2c3THhiLvPPm3siIRttAeQX9NpQTA67mpQU1utmTyi24zO2xo%2FoeG6yqayPhN24fvewd0JJnJsw%2BEg0LJao4c5C%2BezFuQaoOQwpDR5U8FKAbfbXoV...
@tobiashenkel Most likely your task can't talk to that bucket. The part of the error message that is getting truncated after the URL is likely "timeout". (Possible fixes: create an S3 gateway in your VPC; create a NAT gateway in your VPC.)
AWS: This is a minor security issue, since the signed URL in the error message (which folks are copying and pasting) contains credentials enough for others to also potentially pull the container. If possible, *-starport-*
bucket URLs should be scrubbed from error messages before doing any work to expand the error message visibility.
+1
On 14 Apr 2022, at 23:00, Andrés Ignacio Torres @.***> wrote:
ï»ż +1
â Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.
+1
On 14 Apr 2022, at 23:00, Andrés Ignacio Torres @.***> wrote:
ï»ż +1
â Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.
please just leave a thumbs up in the original post above instead of leaving a separate comment
@tobiashenkel Most likely your task can't talk to that bucket. The part of the error message that is getting truncated after the URL is likely "timeout". (Possible fixes: create an S3 gateway in your VPC; create a NAT gateway in your VPC.)
AWS: This is a minor security issue, since the signed URL in the error message (which folks are copying and pasting) contains credentials enough for others to also potentially pull the container. If possible,
*-starport-*
bucket URLs should be scrubbed from error messages before doing any work to expand the error message visibility.
Thanks, we already figured out the root cause (typo in security group). I just wanted to mention here that not all error messages are covered by the previous fix.
When running
aws ecs describe-tasks --cluster <cluster-id> --tasks <task-id>
my problem is not the stoppedReason
field but containers[].reason
field (simplified JSON):
{
"tasks":[
{
"containers":[
{
"lastStatus":"STOPPED",
"reason":"CannotStartContainerError: Error response from daemon: driver failed programming external connectivity on endpoint ecs-BazelClusterStackBazelRemoteCacheTaskDefD45D67B6-8-bazel-remote-cache-server-aecccec899b9909b0400 (8274b80ccd5d9d806417b5e2b60f8b553e517",
"healthStatus":"UNKNOWN"
}
],
"stoppedReason":"Task failed to start"
}
]
}
So the error message is also cut of here:
CannotStartContainerError: Error response from daemon: driver failed programming external connectivity on endpoint ecs-BazelClusterStackBazelRemoteCacheTaskDefD45D67B6-8-bazel-remote-cache-server-aecccec899b9909b0400 (8274b80ccd5d9d806417b5e2b60f8b553e517
It's also cut off in the AWS Console.
This is super annoying because the important part is missing..
Same thing is happening to me. Also unable to pull any relevant information from CloudTrail, with the Task GUID
Same happened to me. In my case, instead of CloudTrail, I could find my logs as a stream in CloudWatch, even if my task didn't seem to throw any errors at all in the ECS screen before failing.
@ysfaran a bit late poking on this. Do you still use the container.reason field? We are looking at extending the container.reason field to 1024 characters.
@weijuans thanks for looking into this.
Yes I still use it, although in most of the cases I did not have this problem anymore, as the message was short enough to be fully visible. But I would still highly recommend to increase the limit.
While 1024 is already much better than 255, I still don't see the point in cutting off this message in general. I would assume that for success messages this limit is irrelevant as the message would be short enough every time. But in the case of an error, this reason field might be longer than expected and maybe for some dev even 1024 characters could not be enough to analyse the error.
That being said, I personally would currently be fine with the new 1024 limit as of now.
Linking the question-issue from AWS-team here: https://github.com/aws/containers-roadmap/issues/2366 (Stopped task error message enhancements input requested)
It seems that the container.reason field is still truncated to 255 characters as of writing this message despite #2366 being closed.
Community Note
Tell us about your request What do you want us to build?
Which service(s) is this request for? This could be Fargate, ECS
Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? Currently Tasks that fail with long reasons, get their outputs truncated, which limits debugging on failures.
Example:
The output is truncated after "Context ca" which we assume is Context Canceled.
There are more examples but currently not at hand
Are you currently working around this issue? Not possible to workaround
Additional context no context
Attachments If you think you might have additional information that you'd like to include via an attachment, please do - we'll take a look. (Remember to remove any personally-identifiable information.)