aws / containers-roadmap

This is the public roadmap for AWS container services (ECS, ECR, Fargate, and EKS).
https://aws.amazon.com/about-aws/whats-new/containers/
Other
5.2k stars 316 forks source link

[FARGATE][ECS] [request]: Stop Truncating the output for Task Failures #1133

Open dsalamancaMS opened 3 years ago

dsalamancaMS commented 3 years ago

Community Note

Tell us about your request What do you want us to build?

Which service(s) is this request for? This could be Fargate, ECS

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? Currently Tasks that fail with long reasons, get their outputs truncated, which limits debugging on failures.

Example:

CannotPullContainerError: containerd: pull command failed: time="2020-XX-XXTXX:XX:XXZ" level=info msg="apply failure, attempting cleanup" error="failed to extract layer sha256:XXXXXXXX: context ca...

The output is truncated after "Context ca" which we assume is Context Canceled.

There are more examples but currently not at hand

Are you currently working around this issue? Not possible to workaround

Additional context no context

Attachments If you think you might have additional information that you'd like to include via an attachment, please do - we'll take a look. (Remember to remove any personally-identifiable information.)

maxgoldberg commented 3 years ago

Adding a note to say that in some contexts, this can truncate literally the most important parts of an error, example given:

ResourceInitializationError: unable to pull secrets or registry auth: execution resource retrieval failed: unable to retrieve secret from asm: service call has been retried 1 time(s): failed to fetch secret arn:aws:secretsmanager:us-east-1:xxxxxxxxxxx...

kunalsawhney commented 3 years ago

Something similar happened to me recently:

CannotPullContainerError: containerd: pull command failed: time="2021-XX-XXTXX:XX:XXZ" level=info msg="apply failure, attempting cleanup" error="failed to extract layer sha256:XXXXXXXX: write /var...

These truncated messages are very annoying since the most useful part of the reason is not there. Also, AWS support is also not able to figure out the reason. I am still looking for the reason why my tasks keep getting STOPPED.

cmwilhelm commented 3 years ago

Any update on this ticket? We are also running into this. It makes it very hard to debug operational issues in Fargate.

Sytten commented 3 years ago

This is very annoying, I have like 20 secrets and I don't know which one failed to be pulled.

aitorres commented 3 years ago

+1. Running across this very same issue, and it's frustrating not being able to tell what just caused my task to fail.

singsonn commented 3 years ago

Any update on this ticket ? Running into this same issue where the stopped reason is truncated and make it very difficult to investigate exactly what is the cause of the error that made the task stop...

yasunaga-shuto commented 3 years ago

Running into this same issue... This issue is temporarily solved after I deleted all images in ECR and recreated one yesterday. However, this happens today. This is so frustrating...

a-foster commented 3 years ago

In case this helps anyone else, in my case I knew my container had outbound internet access (so the issue wasn't subnet/ IGW related) which meant it could only be something wrong with the IAM policy I'd used for my task.

I double checked my secret ARN and realised that AWS had helpfully appended a '-aBcD' string onto the end of my secret name (I'd assumed that the ARN would just end with the secret name I'd specified...) so I updated my policy and it's working fine.

Prateek-Tyagi commented 3 years ago

ResourceInitializationError: unable to pull secrets or registry auth: execution resource retrieval failed: unable to retrieve ecr registry auth: service call has been retried 1 time(s): InvalidParameterException: Invalid parameter at 'registryIds' fail...

what does it means ? can somebody please suggest me i have been facing this issue from last 3 days. before it was working fine in all 13 cluster with same configuration. now i tried making one more cluster with 1.4.0 fargate and this issue came in. now 8 clusters showing this error.. i have tried everything on internet but issue still remains. any lead please ...

carthic1 commented 3 years ago

Something similar happened to me recently:

CannotPullContainerError: containerd: pull command failed: time="2021-XX-XXTXX:XX:XXZ" level=info msg="apply failure, attempting cleanup" error="failed to extract layer sha256:XXXXXXXX: write /var...

These truncated messages are very annoying since the most useful part of the reason is not there. Also, AWS support is also not able to figure out the reason. I am still looking for the reason why my tasks keep getting STOPPED.

Hi Kunal, were you able to resolve this error? I am also facing the same, and am clueless on how to resolve this.

marcoaal commented 3 years ago

Hi @carthic1 @kunalsawhney , I faced the same issue:

Something similar happened to me recently:

CannotPullContainerError: containerd: pull command failed: time="2021-XX-XXTXX:XX:XXZ" level=info msg="apply failure, attempting cleanup" error="failed to extract layer sha256:XXXXXXXX: write /var...

These truncated messages are very annoying since the most useful part of the reason is not there. Also, AWS support is also not able to figure out the reason. I am still looking for the reason why my tasks keep getting STOPPED.

The main reason for my case was the size of the Docker image that my Fargate task was trying to pull.

The size of my image is really big (+21GB) and the current limit of storage in a Fargate task is 20GB, looking for the AWS doc I found the EphemeralStorage parameter of a ECS Task Definition, adding a considerable size solved the issue:

EphemeralStorage:
        SizeInGiB: 30

I hope this helps you

References: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-ecs-taskdefinition-ephemeralstorage.html https://docs.aws.amazon.com/AmazonECS/latest/developerguide/fargate-task-storage.html

kunalsawhney commented 3 years ago

@carthic1 the issue is due to lack of storage in the ECS tasks. AWS has launched new feature of being able to attach EphemeralStorage of upto 200 GB to your tasks. You can use this capability and increase your task storage.

mohamedFaris47 commented 3 years ago

@carthic1 it was a storage problem with me, and solved it as @marcoaal and @kunalsawhney mentioned. If you are using aws-copilot, just add these two lines in the manifest.yml file for the task

storage:
  ephemeral: 35

35 can be changed up to 200 GB of ephemeral storage

dezren39 commented 3 years ago

Is there anywhere in all of AWS that doesn't truncate this message? Can I run a CLI command, for instance, that will pull the full message? It's really an absurdly short character limit for the purpose.

gshpychka commented 3 years ago

@dezren39 no, it's truncated everywhere. it's pretty absurd

adesgautam commented 3 years ago

Hi,

I am also getting this error: CannotPullContainerError: containerd: pull command failed: time="2021-08-20T13:58:49Z" level=info msg="apply failure, attempting cleanup" error="failed to extract layer sha256:e246d4b4c5f108af0f72da900f45ae9a37e1d184d8d605ab4117293b6990b7b8: write /var... Network

I am trying to add ephemeral storage using console in task definition by clicking on "Configure via json", and adding the below lines: "ephemeralStorage": { "sizeInGiB": "25" }

but now getting the error: Should only contain "family", "containerDefinitions", "volumes", "taskRoleArn", "networkMode", "requiresCompatibilities", "cpu", "memory", "inferenceAccelerators", "executionRoleArn", "pidMode", "ipcMode", "proxyConfiguration", "tags", "placementConstraints"

I can't add extra storage now. Can anyone help me here ?

adesgautam commented 3 years ago

Hi @marcoaal @kunalsawhney @mohamedFaris47 How were you able to resolve the issue ? Can you please help me resolve this issue below ?

Hi,

I am also getting this error: CannotPullContainerError: containerd: pull command failed: time="2021-08-20T13:58:49Z" level=info msg="apply failure, attempting cleanup" error="failed to extract layer sha256:e246d4b4c5f108af0f72da900f45ae9a37e1d184d8d605ab4117293b6990b7b8: write /var... Network

I am trying to add ephemeral storage using console in task definition by clicking on "Configure via json", and adding the below lines: "ephemeralStorage": { "sizeInGiB": "25" }

but now getting the error: Should only contain "family", "containerDefinitions", "volumes", "taskRoleArn", "networkMode", "requiresCompatibilities", "cpu", "memory", "inferenceAccelerators", "executionRoleArn", "pidMode", "ipcMode", "proxyConfiguration", "tags", "placementConstraints"

I can't add extra storage now. Can anyone help me here ?

kunalsawhney commented 3 years ago

Hi @adesgautam,

The extra ephemeralStorage storage for the Fargate tasks cannot be configured through the console.

You can refer to this doc for details on what all options are supported: https://aws.amazon.com/about-aws/whats-new/2021/04/amazon-ecs-aws-fargate-configure-size-ephemeral-storage-tasks/

It clearly states that you can use any of "AWS Copilot CLI, CloudFormation, AWS SDK, and AWS CLI"

slavafomin commented 2 years ago

This gave me an "amazing" user experience today, what a shame :)

techministrator commented 2 years ago

Any update on this guys?

If the error message has shown in full then we can save a lot of time to build more things with AWS instead of going round and round finding for the exact error.

chadnash commented 2 years ago

this ussue was registered on my 60th birthday. it is now 1 and I am 61

RichardBradley commented 2 years ago

This is a very frustrating issue.

sashoalm on StackOverflow suggested that the full error message might be found in CloudTrail, but that didn't work for me: https://stackoverflow.com/questions/66919512/stoppedreason-in-ecs-fargate-is-truncated

A fix or a workaround would be great

Pettles commented 2 years ago

I've also come across this issue several times. The last of which (today) resulted in significant troubleshooting of IAM Roles and SSM Secrets because the first have of the error was regarding retrieving a 'secretsmanager' ARN. However, after a few hours of TS and eventually going to AWS Support, the issue was actually a Networking issue because the IGW was offline.

Once the last portion of the error message was found by the engineer, I saw the context timeout error message and knew exactly what it was. Please fix this, it is very frustrating.

rmontgomery2018 commented 2 years ago

This issue still exists. It would be nice to get some traction on such a small issue but it would really help in troubleshooting.

pitthecat commented 2 years ago

This is a very frustrating issue.

sashoalm on StackOverflow suggested that the full error message might be found in CloudTrail, but that didn't work for me: https://stackoverflow.com/questions/66919512/stoppedreason-in-ecs-fargate-is-truncated

A fix or a workaround would be great

Thank you. This is working for me :)

Than I found the API call made by ECS to SSM.

"errorMessage": "User: arn:aws:sts::<accountId>:assumed-role/ecsTaskExecutionRole/0ba9a209db2848ejafhh17567haj16 is not authorized to perform: ssm:GetParameters on resource: arn:aws:ssm:eu-central-1:<accountId>:parameter//sorry-cypress/minio_pw because no identity-based policy allows the ssm:GetParameters action",

My problem was what I used /${aws_ssm_parameter.sorry_cypress_mongo.name} in Terraform, because aws_ssm_parameter.sorry_cypress_mongo.name already starts with "/" so I ended up with "//" :)

acdha commented 2 years ago

In my case, this was due to a couple of missing permissions for the ECR pull-through cache. I ended up with a policy like this on my ECS task's execution role:

{
    "Statement": [
        {
            "Action": [
                "ecr:CreateRepository",
                "ecr:BatchImportUpstreamImage"
            ],
            "Effect": "Allow",
            "Resource": [
                "arn:aws:ecr:us-east-1:xxx:repository/ecr-public/xray/*",
                "arn:aws:ecr:us-east-1:xxx:repository/ecr-public/cloudwatch-agent/*"
            ],
            "Sid": ""
        }
    ],
    "Version": "2012-10-17"
}
gshpychka commented 2 years ago

@accdha this is off-topic for the issue. The error can literally be dozens of different things at least, no need to post them here. The issue is about ECS truncating the error message regardless of the error or its reason.

Keep in mind that every time you post here, you're sending an email to 25 people.

acdha commented 2 years ago

@accdha this is off-topic for the issue. The error can literally be dozens of different things at least, no need to post them here. The issue is about ECS truncating the error message regardless of the error or its reason.

Keep in mind that every time you post here, you're sending an email to 25 people.

If you look at the history, note that other people have been sharing non-obvious causes. The fix will be when the error messages aren’t truncated, which is why I also raised it with our TAM, but in the meantime people often benefit from suggestions for additional points to review after they’ve exhausted the most obvious options.

gshpychka commented 2 years ago

@accdha this is off-topic for the issue. The error can literally be dozens of different things at least, no need to post them here. The issue is about ECS truncating the error message regardless of the error or its reason.

Keep in mind that every time you post here, you're sending an email to 25 people.

If you look at the history, note that other people have been sharing non-obvious causes. The fix will be when the error messages aren’t truncated, which is why I also raised it with our TAM, but in the meantime people often benefit from suggestions for additional points to review after they’ve exhausted the most obvious options.

Right, but all of these fixes are irrelevant to the containers roadmap issue here, which is about truncated errors.

alewando commented 2 years ago

17 months later, and still no progress? I came to this issue by way of an AWS support ticket, asking how to get the full error message. They pointed me at this github issue, so seems like the team is well aware of it. Funny thing is, AWS support was able to give me the full, non-truncated error message, so it is available somewhere.

My recommendation is to open a support ticket if you want the full error. Maybe that will help put some priority on this issue.

chadnash commented 2 years ago

good idea

On Wed, 16 Mar 2022 at 23:37, Adam Lewandowski @.***> wrote:

17 months later, and still no progress? I came to this issue by way of an AWS support ticket, asking how to get the full error message. They pointed me at this github issue, so seems like the team is well aware of it. Funny thing is, AWS support was able to give me the full, non-truncated error message, so it is available somewhere.

My recommendation is to open a support ticket if you want the full error. Maybe that will help put some priority on this issue.

— Reply to this email directly, view it on GitHub https://github.com/aws/containers-roadmap/issues/1133#issuecomment-1069081541, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAREQLZE6Z55WQRVFDBQOLDVAHIX7ANCNFSM4TFPW4PQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you commented.Message ID: @.***>

cmwilhelm commented 2 years ago

I have come across this problem again, 11 months after posting here the last time. I wasted most of my day today chasing phantoms due to a clipped error message. Fortunately this particular cluster is EC2/ECS, so I was ultimately able to ssh in and rerun the docker commands to see the full, exceedingly helpful error message.

These error messages need to be exposed in full.

FrancoCorleone commented 2 years ago

Here I am again... chasing ghosts and phantoms...

aitorres commented 2 years ago

+1

gshpychka commented 2 years ago

+1

Keep in mind that every time you post here, you're sending an email to 25 29 people. You can just use 👍 on the issue instead of a new comment.

vaibhavkhunger commented 2 years ago

Fargate now supports longer error messages, increasing the length of the ‘stoppedReason’ field in the ECS DescribeTasks API response from 256 to 1028 characters. This should make debugging for task failures easier, providing you with appropriate information around a failure and reducing the occurrence of error message truncation due to the limited length of ‘stoppedReason’ field.

Still keeping this issue open for now to track if anyone continues to face the truncation issue. Please report any such occurrences here.

tobiashenkel commented 1 year ago

Looks like 1024 is not enough for some cases. We just hit this where the token as part of the url exceeds the limit:

CannotPullContainerError: ref pull has been retried 5 time(s): failed to copy: httpReadSeeker: failed open: failed to do request: Get "https://prod-eu-central-1-starport-layer-bucket.s3.eu-central-1.amazonaws.com/cedb17-026714298493-3ac1d3c9-4bbf-51f8-1fa2-2ec2d19d9a14/3ecdbcc6-70f2-48bc-b48f-40441f4dbcf0?X-Amz-Security-Token=IQoJb3JpZ2luX2VjEB0aDGV1LWNlbnRyYWwtMSJHMEUCIQDI33gO5Z6NzDamMOuri8r1rUoiYJ%2B5IAeLwBWGc7IiRAIgYZX%2FME2%2BWL3kN8GvKqlBtdZks3vxdGwvcW4cH%2B6V%2FIwqzgMI1v%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FARAFGgwxNTY5NjA3MTUzNzUiDMskpZ9B%2FM6%2FN%2F231SqiA0RekF%2FcPij5BR3JjJhmxJR4jjvlNlaGBLDIqIExaYB0q4IKAkeqplm8FErC8CT%2Bfm%2FNqiVzJgBAW1erEivpS0EpV0QBmq4yXFNBRnPS%2BSNG%2FI8OV23YsDep66aUDairHaKOsQEcGfT2BK4nuSGzbTxl%2FNhBh6uielnZQNB3p9AVGL%2BC5OT1unY67TFV%2FPmZcdk9cjQdMr07JYbZm7fox0Rs%2BBablGYIh%2FNg3qSo3gpbuBKNqx18T1zmkKQQLt6q2wk1UZE7O5BeN%2FaF%2FLmNOLpZHpVgZEuiuGZGB%2F6qNIKJs3SCuVArZVv801km2c3THhiLvPPm3siIRttAeQX9NpQTA67mpQU1utmTyi24zO2xo%2FoeG6yqayPhN24fvewd0JJnJsw%2BEg0LJao4c5C%2BezFuQaoOQwpDR5U8FKAbfbXoV...
dacut commented 1 year ago

@tobiashenkel Most likely your task can't talk to that bucket. The part of the error message that is getting truncated after the URL is likely "timeout". (Possible fixes: create an S3 gateway in your VPC; create a NAT gateway in your VPC.)

AWS: This is a minor security issue, since the signed URL in the error message (which folks are copying and pasting) contains credentials enough for others to also potentially pull the container. If possible, *-starport-* bucket URLs should be scrubbed from error messages before doing any work to expand the error message visibility.

singsonn commented 1 year ago

+1

On 14 Apr 2022, at 23:00, Andrés Ignacio Torres @.***> wrote:

ï»ż +1

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.

gshpychka commented 1 year ago

+1

On 14 Apr 2022, at 23:00, Andrés Ignacio Torres @.***> wrote:

ï»ż +1

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.

please just leave a thumbs up in the original post above instead of leaving a separate comment

tobiashenkel commented 1 year ago

@tobiashenkel Most likely your task can't talk to that bucket. The part of the error message that is getting truncated after the URL is likely "timeout". (Possible fixes: create an S3 gateway in your VPC; create a NAT gateway in your VPC.)

AWS: This is a minor security issue, since the signed URL in the error message (which folks are copying and pasting) contains credentials enough for others to also potentially pull the container. If possible, *-starport-* bucket URLs should be scrubbed from error messages before doing any work to expand the error message visibility.

Thanks, we already figured out the root cause (typo in security group). I just wanted to mention here that not all error messages are covered by the previous fix.

ysfaran commented 1 year ago

When running

aws ecs describe-tasks --cluster <cluster-id> --tasks <task-id>

my problem is not the stoppedReason field but containers[].reason field (simplified JSON):

{
  "tasks":[
    {
      "containers":[
        {
          "lastStatus":"STOPPED",
          "reason":"CannotStartContainerError: Error response from daemon: driver failed programming external connectivity on endpoint ecs-BazelClusterStackBazelRemoteCacheTaskDefD45D67B6-8-bazel-remote-cache-server-aecccec899b9909b0400 (8274b80ccd5d9d806417b5e2b60f8b553e517",
          "healthStatus":"UNKNOWN"
        }
      ],
      "stoppedReason":"Task failed to start"
    }
  ]
}

So the error message is also cut of here:

CannotStartContainerError: Error response from daemon: driver failed programming external connectivity on endpoint ecs-BazelClusterStackBazelRemoteCacheTaskDefD45D67B6-8-bazel-remote-cache-server-aecccec899b9909b0400 (8274b80ccd5d9d806417b5e2b60f8b553e517

It's also cut off in the AWS Console.

This is super annoying because the important part is missing..

eric-chao-synaptechealth commented 1 year ago

Same thing is happening to me. Also unable to pull any relevant information from CloudTrail, with the Task GUID

javiermrz commented 1 year ago

Same happened to me. In my case, instead of CloudTrail, I could find my logs as a stream in CloudWatch, even if my task didn't seem to throw any errors at all in the ECS screen before failing.

weijuans commented 3 months ago

@ysfaran a bit late poking on this. Do you still use the container.reason field? We are looking at extending the container.reason field to 1024 characters.

ysfaran commented 3 months ago

@weijuans thanks for looking into this.

Yes I still use it, although in most of the cases I did not have this problem anymore, as the message was short enough to be fully visible. But I would still highly recommend to increase the limit.

While 1024 is already much better than 255, I still don't see the point in cutting off this message in general. I would assume that for success messages this limit is irrelevant as the message would be short enough every time. But in the case of an error, this reason field might be longer than expected and maybe for some dev even 1024 characters could not be enough to analyse the error.

That being said, I personally would currently be fine with the new 1024 limit as of now.

rgoltz commented 2 months ago

Linking the question-issue from AWS-team here: https://github.com/aws/containers-roadmap/issues/2366 (Stopped task error message enhancements input requested)

skairunner commented 2 weeks ago

It seems that the container.reason field is still truncated to 255 characters as of writing this message despite #2366 being closed.