aws ecs execute-command fails with TargetNotConnectedException

leejayhsu commented 2 weeks ago

Describe the bug

I am unable to use ecs execute-command to connect to my ecs fargate task

Regression Issue

[X] Select this option if this issue appears to be a regression.

Expected Behavior

I should be able to connect to my ecs fargate task

Current Behavior

It fails to connect to ecs fargate task

command

aws ecs execute-command  \
--region us-west-2 \
--cluster core-services \
--task d179d101efa94c98aa62340b5705d726 \
--container app \
--command "/bin/bash" \
--interactive

Error

The Session Manager plugin was installed successfully. Use the AWS CLI to start a session.

An error occurred (TargetNotConnectedException) when calling the ExecuteCommand operation: The execute command failed due to an internal error. Try again later.

amazon-ecs-exec-checker output

Prerequisites for check-ecs-exec.sh v0.7
-------------------------------------------------------------
  jq      | OK (/opt/homebrew/bin/jq)
  AWS CLI | OK (/opt/homebrew/bin/aws)

-------------------------------------------------------------
Prerequisites for the AWS CLI to use ECS Exec
-------------------------------------------------------------
  AWS CLI Version        | OK (aws-cli/2.19.4 Python/3.12.7 Darwin/24.0.0 source/arm64)
  Session Manager Plugin | OK (1.2.688.0)

-------------------------------------------------------------
Checks on ECS task and other resources
-------------------------------------------------------------
Region : us-west-2
Cluster: core-services
Task   : d179d101efa94c98aa62340b5705d726
-------------------------------------------------------------
  Cluster Configuration  |
     KMS Key       : Not Configured
     Audit Logging : OVERRIDE
     S3 Bucket Name: Not Configured
     CW Log Group  : /ecs/dev/core-services, Encryption Enabled: true
  Can I ExecuteCommand?  | arn:aws:iam::xxxxx:user/xxxxx
     ecs:ExecuteCommand: allowed
     ssm:StartSession denied?: allowed
  Task Status            | RUNNING
  Launch Type            | Fargate
  Platform Version       | 1.4.0
  Exec Enabled for Task  | OK
  Container-Level Checks |
    ----------
      Managed Agent Status
    ----------
         1. RUNNING for "log-router"
         2. RUNNING for "datadog-agent"
         3. RUNNING for "app"
    ----------
      Init Process Enabled (dev-app-task-def:555)
    ----------
         1. Enabled - "app"
         2. Disabled - "datadog-agent"
         3. Disabled - "log-router"
    ----------
      Read-Only Root Filesystem (dev-app-task-def:555)
    ----------
         1. Disabled - "app"
         2. Disabled - "datadog-agent"
         3. Disabled - "log-router"
  Task Role Permissions  | arn:aws:iam::xxxxx:role/ecsTaskExecutionRole
     ssmmessages:CreateControlChannel: allowed
     ssmmessages:CreateDataChannel: allowed
     ssmmessages:OpenControlChannel: allowed
     ssmmessages:OpenDataChannel: allowed
     -----
     logs:DescribeLogGroups: allowed
     logs:CreateLogStream: allowed
     logs:DescribeLogStreams: allowed
     logs:PutLogEvents: allowed
  VPC Endpoints          |
    Found existing endpoints for vpc-xxxxx:
      - com.amazonaws.us-west-2.s3
      - com.amazonaws.us-west-2.secretsmanager
      - com.amazonaws.us-west-2.ecr.api
      - com.amazonaws.us-west-2.ecr.dkr
      - com.amazonaws.us-west-2.ssmmessages
  Environment Variables  | (dev-app-task-def:555)
       1. container "app"
       - AWS_ACCESS_KEY: not defined
       - AWS_ACCESS_KEY_ID: not defined
       - AWS_SECRET_ACCESS_KEY: not defined
       2. container "datadog-agent"
       - AWS_ACCESS_KEY: not defined
       - AWS_ACCESS_KEY_ID: not defined
       - AWS_SECRET_ACCESS_KEY: not defined
       3. container "log-router"
       - AWS_ACCESS_KEY: not defined
       - AWS_ACCESS_KEY_ID: not defined
       - AWS_SECRET_ACCESS_KEY: not defined

Reproduction Steps

run this command:

aws ecs execute-command  \
--region us-west-2 \
--cluster core-services \
--task d179d101efa94c98aa62340b5705d726 \
--container app \
--command "/bin/bash" \
--interactive

Possible Solution

No response

Additional Information/Context

No response

CLI version used

2.19.4

Environment details (OS name and version, etc.)

Python/3.12.7 Darwin/24.0.0 source/arm64

tim-finnigan commented 2 weeks ago

Thanks for reaching out. The TargetNotConnectedException has been reported in several past issues. Have you tried looking through those?

In this troubleshooting post for it says you might get that error for the following reasons:

The Amazon ECS task role doesn't have the required permissions to run the execute-command command.

The AWS Identity and Access Management (IAM) role or user that's running the command doesn't have the required permissions.

Others have suggested that the issue could be fixed by changing your environment variables or updating your AMI.

Also could you explain why you marked this as potential-regression? Was this working for you in a previous version of the AWS CLI?

leejayhsu commented 1 week ago

Hi @tim-finnigan 👋

Yeah I have looked at most of those past issues, but I will look again to make sure I didn't miss any potential solutions.

For context, I'm using ecs fargate, platform version 1.4

Things I've tried to fix this:

verified that ecs task taskRoleArn and executionRoleArn both have the following permissions

{
"Statement": [
    {
        "Action": [
            "ssmmessages:OpenDataChannel",
            "ssmmessages:OpenControlChannel",
            "ssmmessages:CreateDataChannel",
            "ssmmessages:CreateControlChannel",
        ],
        "Effect": "Allow",
        "Resource": "*"
    }
],
"Version": "2012-10-17"
}

verified that my aws role that is trying to exec has the permission ecs:ExecuteCommand
ran https://github.com/aws-containers/amazon-ecs-exec-checker, no errors
ecs task has outbound internet connectivity (but I also created a vpc endpoing for ssm messages just in case com.amazonaws.us-west-2.ssmmessages)
do NOT have AWS_ACCESS_KEY_ID or AWS_SECRET_ACCESS_KEY as env vars in my tasks

ecs exec used to work for me, so I thought it would be ok to mark this as a regression. But this is only conjecture on my part, so please remove the tag if you feel it is appropriate!

rnathuji commented 1 week ago

Just to chime in on a potential regression: We are also experiencing this issue with Fargate where things were working fine, and then seemingly stopped working suddenly for no apparent reason. amazon-ecs-exec-checker is clear.

tim-finnigan commented 1 week ago

Thanks for following up - we may need to loop in ECS/Fargate here as well. Did this issue start occurring after updating to a specific version? Could you share your debug logs (with any sensitive info redacted) to help with further investigation?

rnathuji commented 1 week ago

@tim-finnigan - I...spoke too soon when chiming in above :sweat_smile: . I believe the issue was a bug in our infrastructure as code which caused some non-determinism related to the subnet associated with tasks. A container cycle caused some to land in an isolated subnet inadvertently, and that was the root issue for the "suddenly for no apparent reason". Fixing the IaC issue solved our problem.

tim-finnigan commented 1 week ago

@tim-finnigan - I...spoke too soon when chiming in above 😅 . I believe the issue was a bug in our infrastructure as code which caused some non-determinism related to the subnet associated with tasks. A container cycle caused some to land in an isolated subnet inadvertently, and that was the root issue for the "suddenly for no apparent reason". Fixing the IaC issue solved our problem.

No worries, thanks for following up and glad that issue is resolved. For the original issue author — I'll mention this troubleshooting guide again for reference: https://repost.aws/knowledge-center/fargate-ecs-exec-errors. If you're still seeing the issue, please share your debug logs for further investigation.

leejayhsu commented 1 week ago

hi @tim-finnigan

I've narrowed the problem down to a sidecar container `aws-fluent-bit, which I was using to stream logs to datadog. I'm not exactly sure why it's a problem, but I can exec into the fargate task once I remove the aws-fluent-bit container from the task definition.

Do you happen to know if there are any known issues that would cause fluent bit to interfere with ecs exec? This is the relevant part of the task def

{
    "name": "log-router",
    "image": "amazon/aws-for-fluent-bit:stable",
    "cpu": 0,
    "portMappings": [],
    "essential": false,
    "environment": [],
    "mountPoints": [],
    "volumesFrom": [],
    "user": "0",
    "dockerLabels": {
        "com.datadoghq.tags.service": "log-router",
        "com.datadoghq.tags.env": "dev"
    },
    "systemControls": [],
    "firelensConfiguration": {
        "type": "fluentbit",
        "options": {
            "config-file-type": "file",
            "config-file-value": "/fluent-bit/configs/parse-json.conf",
            "enable-ecs-log-metadata": "true"
        }
    }
}

leejayhsu commented 1 week ago

confirmed that removing aws-fluent-bit container from the task definition fixed the issue. now ecs exec is working properly.

github-actions[bot] commented 1 week ago

This issue is now closed. Comments on closed issues are hard for our team to see. If you need more assistance, please open a new issue that references this one.

lkashef commented 1 week ago

Hey @leejayhsu we are facing the same problem. I assume removing the log-router can't be a permanent solution, am curious what did you end up doing?

leejayhsu commented 1 week ago

hi @lkashef 👋 Actually removing log-router was my permanent solution 😄 It only existed in the task definition because the logging aggregator I used recommended streaming logs to it. I'm now just logging to cloudwatch, and no longer using fluent-bit for logging.

sorry this probably isn't the answer you were hoping for!

leejayhsu commented 6 days ago

@lkashef I also had another task which I couldn't exec into, and disabling logging in the datadog-agent container fixed it (this was quite unexpected).

aws / aws-cli