aws / containers-roadmap

This is the public roadmap for AWS container services (ECS, ECR, Fargate, and EKS).
https://aws.amazon.com/about-aws/whats-new/containers/
Other
5.22k stars 321 forks source link

[ECS/Fargate] [request]: ECS Exec : support readonlyRootFilesystem containers #1359

Open sd65 opened 3 years ago

sd65 commented 3 years ago

Community Note

Tell us about your request

I would like to use the ECS Exec feature with readonlyRootFilesystem enabled containers.

Which service(s) is this request for?

ECS/Fargate

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

Currently readonlyRootFilesystem enabled containers are not supported, the AWS managed agent crash soon after launch.

Are you currently working around this issue?

Yes. I've managed to get it working with readonlyRootFilesystem: true by mounting /managed-agents, /var/lib/amazon/ssm and /var/log/amazon/ssm as writable volumes inside.

Additional context

https://github.com/aws-containers/amazon-ecs-exec-checker/issues/21

toricls commented 3 years ago

Wrote a working around for this limitation - https://toris.io/2021/06/using-ecs-exec-with-readonlyrootfilesystem-enabled-containers/

naomine-biz commented 2 years ago

If you use ec2 backed ecs agent version 1.57.0, you should not specify the bind mount /var/log/amazon/ssm as it will overlap with the mount set by the agent and prevent the container from starting.

dariusz22p commented 1 year ago

Is it the same on EKS?

bmfs commented 1 year ago

Write a small article for working around this limitation - https://toris.io/2021/06/using-ecs-exec-with-readonlyrootfilesystem-enabled-containers/

Was unable to replicate this workarounds. Both by declaring the volumes in Dockerfile or in the Task Definition. Maybe something changed in the SSM Agent that now prevents this workaround.

jdoylei commented 8 months ago

Hi @bmfs - I just wanted to note that the workaround works for me in ECS Fargate 1.4.0, using the Task Definition approach. So it might be due to an environment difference rather than changes in SSM Agent.

Our task-definition has the 3 volumes:

    "volumes": [
        {
            "name": "managed-agents",
            "host": {}
        },
        {
            "name": "var-lib-amazon-ssm",
            "host": {}
        },
        {
            "name": "var-log-amazon-ssm",
            "host": {}
        },

And the 3 mount points in one of the containers:

            "mountPoints": [
                {
                    "sourceVolume": "managed-agents",
                    "containerPath": "/managed-agents",
                    "readOnly": false
                },
                {
                    "sourceVolume": "var-lib-amazon-ssm",
                    "containerPath": "/var/lib/amazon/ssm",
                    "readOnly": false
                },
                {
                    "sourceVolume": "var-log-amazon-ssm",
                    "containerPath": "/var/log/amazon/ssm",
                    "readOnly": false
                },

This container has the agent running:

                    "managedAgents": [
                        {
                            "lastStartedAt": "2024-03-25T12:22:04.019000-04:00",
                            "name": "ExecuteCommandAgent",
                            "lastStatus": "RUNNING"
                        }
                    ],

(Other containers in the same task-definition without the mount points have the agent stopped:)

                    "managedAgents": [
                        {
                            "name": "ExecuteCommandAgent",
                            "lastStatus": "STOPPED"
                        }
                    ],

With this configuration, we're able to use "aws ecs execute-command" on the container with the agent running:

PS C:\Users\u123> aws ecs execute-command --profile xyz --cluster xyz --container xyz --interactive --command "/bin/sh" --task arnxyz

sh-4.4# df -a | grep agents\\\|ssm
/dev/nvme1n1    30787492 13423340  15774904  46% /managed-agents
/dev/nvme1n1    30787492 13423340  15774904  46% /var/lib/amazon/ssm
/dev/nvme1n1    30787492 13423340  15774904  46% /var/log/amazon/ssm
/dev/nvme0n1p1   5082764  2126208   2887764  43% /managed-agents/execute-command

sh-4.4# ps wwax --forest
  PID TTY      STAT   TIME COMMAND
  101 ?        Ssl    0:00 /managed-agents/execute-command/amazon-ssm-agent
  157 ?        Sl     0:00  \_ /managed-agents/execute-command/ssm-agent-worker
25146 ?        Sl     0:00      \_ /managed-agents/execute-command/ssm-session-worker ecs-execute-command-c9d0acd90ca90
25165 pts/0    Ss     0:00          \_ /bin/sh
25942 pts/0    R+     0:00              \_ ps wwax --forest

@sd65 and @toricls - thanks so much for documenting this workaround for other ECS users. AWS ought to at least note this workaround in its documentation, if only with the caveat that the user is taking responsibility for it continuing to work.

gmuslia commented 2 months ago

Find below the error displayed in the cli when this issue occurs (attaching here for easier search from internet):

An error occurred (InvalidParameterException) when calling the ExecuteCommand operation: The execute command failed because execute command was not enabled when the task was run or the execute command agent isn’t running. Wait and try again or run a new task with execute command enabled and try again.
mselcik commented 1 month ago

Hi @bmfs - I just wanted to note that the workaround works for me in ECS Fargate 1.4.0, using the Task Definition approach. So it might be due to an environment difference rather than changes in SSM Agent.

Our task-definition has the 3 volumes:

    "volumes": [
        {
            "name": "managed-agents",
            "host": {}
        },
        {
            "name": "var-lib-amazon-ssm",
            "host": {}
        },
        {
            "name": "var-log-amazon-ssm",
            "host": {}
        },

And the 3 mount points in one of the containers:

            "mountPoints": [
                {
                    "sourceVolume": "managed-agents",
                    "containerPath": "/managed-agents",
                    "readOnly": false
                },
                {
                    "sourceVolume": "var-lib-amazon-ssm",
                    "containerPath": "/var/lib/amazon/ssm",
                    "readOnly": false
                },
                {
                    "sourceVolume": "var-log-amazon-ssm",
                    "containerPath": "/var/log/amazon/ssm",
                    "readOnly": false
                },

This container has the agent running:

                    "managedAgents": [
                        {
                            "lastStartedAt": "2024-03-25T12:22:04.019000-04:00",
                            "name": "ExecuteCommandAgent",
                            "lastStatus": "RUNNING"
                        }
                    ],

(Other containers in the same task-definition without the mount points have the agent stopped:)

                    "managedAgents": [
                        {
                            "name": "ExecuteCommandAgent",
                            "lastStatus": "STOPPED"
                        }
                    ],

With this configuration, we're able to use "aws ecs execute-command" on the container with the agent running:

PS C:\Users\u123> aws ecs execute-command --profile xyz --cluster xyz --container xyz --interactive --command "/bin/sh" --task arnxyz

sh-4.4# df -a | grep agents\\\|ssm
/dev/nvme1n1    30787492 13423340  15774904  46% /managed-agents
/dev/nvme1n1    30787492 13423340  15774904  46% /var/lib/amazon/ssm
/dev/nvme1n1    30787492 13423340  15774904  46% /var/log/amazon/ssm
/dev/nvme0n1p1   5082764  2126208   2887764  43% /managed-agents/execute-command

sh-4.4# ps wwax --forest
  PID TTY      STAT   TIME COMMAND
  101 ?        Ssl    0:00 /managed-agents/execute-command/amazon-ssm-agent
  157 ?        Sl     0:00  \_ /managed-agents/execute-command/ssm-agent-worker
25146 ?        Sl     0:00      \_ /managed-agents/execute-command/ssm-session-worker ecs-execute-command-c9d0acd90ca90
25165 pts/0    Ss     0:00          \_ /bin/sh
25942 pts/0    R+     0:00              \_ ps wwax --forest

@sd65 and @toricls - thanks so much for documenting this workaround for other ECS users. AWS ought to at least note this workaround in its documentation, if only with the caveat that the user is taking responsibility for it continuing to work.

Thanks for this workaround. What is the purpose of including the /managed-agents volume? I successfully implemented this workaround on ECS Fargate with only /var/lib/amazon/ssm and /var/log/amazon/ssm. Note that I used 'aws ssm start-session' rather than 'aws ecs execute-command'.

It seems like the /managed-agents directory contains the agent binaries and I'm not sure that data will be written there while the agent is running.

jdoylei commented 1 month ago

What is the purpose of including the /managed-agents volume?

@mselcik - I checked my notes from earlier this year, and I think I included /managed-agents from the start based on the tips from @sd65 at the beginning of this issue. I don't think it was that I ran into an issue and was required to add it. I notice now, there was a filesystem/volume created automatically for /managed-agents/execute-command where the binaries are, which must be distinct from the /managed-agents volume I specified in the task definition. Maybe the automatically-created /managed-agents/execute-command is all that's necessary, rather than /managed-agents.

mselcik commented 1 month ago

What is the purpose of including the /managed-agents volume?

@mselcik - I checked my notes from earlier this year, and I think I included /managed-agents from the start based on the tips from @sd65 at the beginning of this issue. I don't think it was that I ran into an issue and was required to add it. I notice now, there was a filesystem/volume created automatically for /managed-agents/execute-command where the binaries are, which must be distinct from the /managed-agents volume I specified in the task definition. Maybe the automatically-created /managed-agents/execute-command is all that's necessary, rather than /managed-agents.

Thanks for your response. I had a look and also observed that a filesystem at /managed-agents/execute-command was automatically created. Below is partial output from the "mount" command:

/dev/nvme1n1 on /tmp type ext4 (rw,relatime) /dev/nvme1n1 on /var/lib/amazon/ssm type ext4 (rw,relatime) /dev/nvme1n1 on /var/log/amazon/ssm type ext4 (rw,relatime) /dev/nvme1n1 on /etc/hosts type ext4 (rw,relatime) /dev/nvme1n1 on /etc/resolv.conf type ext4 (rw,relatime) /dev/nvme1n1 on /etc/hostname type ext4 (rw,relatime) /dev/nvme0n1p1 on /managed-agents/execute-command type ext4 (ro,noatime)

The first three filesystems are created due to three ECS volumes being specified in the task definition. However the /managed-agents/execute-command filesystem is automatically created and read-only, so my conclusion is that this volume does not need to be created as part of the ECS task definition in order to enable "execute-command".

h0rv commented 2 weeks ago

Workaround failed for me with and without /managed-agents mount:

An error occurred (TargetNotConnectedException) when calling the ExecuteCommand operation: The execute command failed due to an internal error. Try again later.
jdoylei commented 2 weeks ago

@h0rv - If you describe your task, can you see what the managedAgents block says? I think that gives an indication, even before you try ecs exec, whether the agent has been started OK:

                    "managedAgents": [
                        {
                            "lastStartedAt": "2024-03-25T12:22:04.019000-04:00",
                            "name": "ExecuteCommandAgent",
                            "lastStatus": "RUNNING"
                        }
                    ],
h0rv commented 2 weeks ago

aws ecs exec checker gives me the output below, indicating the agent is running fine on all containers:

...
     ecs:ExecuteCommand: allowed
     ssm:StartSession denied?: allowed
  Task Status            | RUNNING
  Launch Type            | Fargate
  Platform Version       | 1.4.0
  Exec Enabled for Task  | OK
  Container-Level Checks |
    ----------
      Managed Agent Status
    ----------
         1. RUNNING for "datadog-agent"
         2. RUNNING for "logging-router"
         3. RUNNING for "my-service"
...
aws ecs execute-command \                                                                                                             
    --cluster <cluster_arn> \
    --task <task_arn> \
    --container my-service \
    --command "/bin/bash" \
    --interactive

Results in the error:

An error occurred (TargetNotConnectedException) when calling the ExecuteCommand operation: The execute command failed due to an internal error. Try again later.

Also, each container has its own volumes for each bind mount: https://repost.aws/knowledge-center/ecs-error-execute-command#COBQ6pGrzfSSaDtECWcDVqBw.

jdoylei commented 2 weeks ago

@h0rv - Wish I had a better idea, but, dumb question: is /bin/bash really available and executable in that my-service container? I went back through my notes and didn't see much different about what you pasted, I only noticed that in my container I was using /bin/sh, which made me wonder.

h0rv commented 2 weeks ago

@jdoylei - Yes it is available in my container and I was able to run this command before turning on readonly.

jdoylei commented 2 weeks ago

@h0rv - I see, sorry I couldn't be more help. It's tricky when the main tool you have for debugging - ecs exec - doesn't work itself. I think when I was troubleshooting ecs exec I had to resort to embedding commands in my containers at startup to dump the filesystem list, dump the process tree, etc., to see what was going on.