aws / amazon-ssm-agent

An agent to enable remote management of your EC2 instances, on-premises servers, or virtual machines (VMs).
https://aws.amazon.com/systems-manager/
Apache License 2.0
1.03k stars 323 forks source link

Recent updates possibly broke CLI `execute-command` #435

Closed ssyberg closed 2 years ago

ssyberg commented 2 years ago

There are a number of github issues floating around on related repos that might be tied to recent ssm agent updates, though this is incredibly difficult to verify from our end, if someone could do a little investigating that would be great.

The general issue that manifests is an inability to run the execute-command via cli and a TargetNotConnectedException thrown. Existing troubleshooting guides have thus far not yielded success.

Related tickets:

https://github.com/aws/aws-cli/issues/6834 https://github.com/aws/aws-cli/issues/6562 https://github.com/aws-containers/amazon-ecs-exec-checker/issues/47

GeorgeNagel commented 2 years ago

Example output from aws ecs execute-command ...:

The Session Manager plugin was installed successfully. Use the AWS CLI to start a session.

An error occurred (TargetNotConnectedException) when calling the ExecuteCommand operation: The execute command failed due to an internal error. Try again later.
ssyberg commented 2 years ago

Exact output for everyone with this problem as far as I can tell ☝🏼

tim-finnigan commented 2 years ago

This looks related: https://github.com/aws-containers/amazon-ecs-exec-checker/issues/49

Do you also have AWS_ACCESS_KEY / AWS_SECRET_ACCESS_KEY set? That may be causing the issue.

ssyberg commented 2 years ago

Do you also have AWS_SECRET_ACCESS_KEY set? That may be causing the issue.

If my parsing of the terraform config can be trusted, we are not setting that in environment_variables but it is available in the secrets

I'll try removing this now and see if that makes a difference.

ssyberg commented 2 years ago

Do you also have AWS_SECRET_ACCESS_KEY set? That may be causing the issue.

If my parsing of the terraform config can be trusted, we are not setting that in environment_variables but it is available in the secrets

I'll try removing this now and see if that makes a difference.

Holy moly that worked! That said, we actually actively use those credentials in our task, so we'll need a workaround for exposing them. Still seems like setting these env vars shouldn't have this effect right?

tim-finnigan commented 2 years ago

Glad that worked! I'm waiting on more info regarding this and will post an update here.

farkmarnum commented 2 years ago

Renaming AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY has also fixed the problem for me! Bizarre that this just started happening at ~5pm EST March 30 out of nowhere.

nathando commented 2 years ago

Can we revert to previous version of aws cli to fix this ? Because changing the environments will break other things in our tasks

raptorcZn commented 2 years ago

Facing this issue as well. As @nathando mentioned, would be great if it reverted to the previous behaviour so that we don't have to change the environment variables.

nicolasbuch commented 2 years ago

Encountered this error: "An error occurred (TargetNotConnectedException) when calling the ExecuteCommand operation: The execute command failed due to an internal error. Try again later." out of nowhere 4 days ago.

In my case i also had AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY set as ENV variables (in my task description) since my application needs to interact with the AWS API. It was working fine until now, so something must have changes in recent updates.

There is no need to change the environment variables through, all you need to do is to give the user (AWS_ACCESS_KEY_ID) permissions to allow the ECS exec command

{
   "Version": "2012-10-17",
   "Statement": [
       {
       "Effect": "Allow",
       "Action": [
            "ssmmessages:CreateControlChannel",
            "ssmmessages:CreateDataChannel",
            "ssmmessages:OpenControlChannel",
            "ssmmessages:OpenDataChannel"
       ],
      "Resource": "*"
      }
   ]
}
tim-finnigan commented 2 years ago

Thanks @nicolasbuch, those requirements are also documented here: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-exec.html#ecs-exec-prerequisites as well as this troubleshooting article for the TargetNotConnectedException error: https://aws.amazon.com/premiumsupport/knowledge-center/ecs-error-execute-command/

Those requirements aren’t new so I’m not sure why recent updates would be a factor here. Has anyone tried rolling back to a previous SSM Agent version to see if they still see this issue? It would help the team to have agent logs from a container that is experiencing the issue. You could provide those here or contact AWS Support.

Thor-Bjorgvinsson commented 2 years ago

The agent version in ECS Exec is controlled by ECS during AMI build and they say they haven't changed the version recently. Can anyone here that encountered the issue and has removed AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY from their environment start a session and get the agent version?

# Assuming your session starts in the ECS Exec bin folder
./amazon-ssm-agent -version

Also, are you seeing this issue on ECS on EC2 or Fargate?

pauricthelodger commented 2 years ago

@Thor-Bjorgvinsson after making the change and removing the envvars I can access the containers and see the following versions according to the log output on Fargate tasks

amazon-ssm-agent - v3.1.715.0
ssm-agent-worker - v3.1.715.0
GeorgeNagel commented 2 years ago

@Thor-Bjorgvinsson Seeing the issue on Fargate.

yufio commented 2 years ago

We also experience the same issue since last Friday (01 April 2022). We didn't change anything and the command execution stopped working. We also have AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY in the env. Funny enough on one of our environments it still works but on 2 other stopped. We are now investigating the permissions differences. The user on that env has admin access rights (dev env)

Thor-Bjorgvinsson commented 2 years ago

We've confirmed that this is a SSM Agent issue in a recent Fargate deployment where the agent version was updated. Any new tasks started in Fargate will use a SSM Agent build with this issue. We are working with the Fargate team to deploy a fix for this. Mitigation as mentioned above, remove AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY from task definition environment variables

akhiljalagam commented 2 years ago

Encountered this error: "An error occurred (TargetNotConnectedException) when calling the ExecuteCommand operation: The execute command failed due to an internal error. Try again later." out of nowhere 4 days ago.

In my case i also had AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY set as ENV variables (in my task description) since my application needs to interact with the AWS API. It was working fine until now, so something must have changes in recent updates.

There is no need to change the environment variables through, all you need to do is to give the user (AWS_ACCESS_KEY_ID) permissions to allow the ECS exec command

{
   "Version": "2012-10-17",
   "Statement": [
       {
       "Effect": "Allow",
       "Action": [
            "ssmmessages:CreateControlChannel",
            "ssmmessages:CreateDataChannel",
            "ssmmessages:OpenControlChannel",
            "ssmmessages:OpenDataChannel"
       ],
      "Resource": "*"
      }
   ]
}

this worked for me.

Thor-Bjorgvinsson commented 2 years ago

@akhiljalagam I can confirm this can be used for mitigation today but not recommended, this will not be possible in the close future sometime after fix has been released. Agent will only be able to connect using ECS Task metadata service credentials.

The recommended mitigation is to unset the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables

Sohett commented 2 years ago

@Thor-Bjorgvinsson, how can we follow the status of this? I don't want to be too pushy, but we're really blocked 😬. Is there some kind of prioritisation as this is a regression?

Anyway, thanks for the work 💪 .

Thor-Bjorgvinsson commented 2 years ago

We've pushed out a fix in agent release 3.1.1260.0 for this issue. We're currently working with related AWS services to integrate this fix; we'll add further updates as those integrations are completed.

jmagoon commented 2 years ago

For other people who come across this issue, this error happens for us when we have AWS_SHARED_CREDENTIALS_FILE set as an environment variable as well. When it is removed, ecs execute-command works correctly.

ZacBridge commented 2 years ago

Hopefully this doesn't put a spanner in the works - but i've been having this issue across all of my services. Only 1 of the services actually had AWS env vars in them, after renaming those that service was fine.

The others however, still respond with the same "Internal server error", with no AWS env vars to note on the tasks.

GeorgeNagel commented 2 years ago

I'm seeing this again since the 3.1.1260.0 release. Is it possible other env variable names are now disallowed? In particular, I changed my AWS_SECRET_ACCESS_KEY env variable to AWS_SECRET_ACCESS_KEY_ECS, which was working until the 3.1.1260.0 release. Now changing that key to AWS_SECRET_ACCESS_KEY_<something>_ECS, I am able to connect again.

I'm wondering if the fix in 3.1.1260.0 was to switch from using AWS_SECRET_ACCESS_KEY to AWS_SECRET_ACCESS_KEY_ECS in some internal API. If so, perhaps more of a root cause fix is needed. Or documentation which specifies which env variable names cause these conflicts.

djGrill commented 2 years ago

maybe it's partially matching AWS_SECRET_ACCESS_KEY* instead of just AWS_SECRET_ACCESS_KEY? 🤔

justinko commented 2 years ago

No error for me with AWS_SECRET_ACCESS_KEY_2

bigbluechicken commented 2 years ago

Is there any update on when the fix will be rolled out?

serhiibeznisko commented 2 years ago

Renaming AWS_SECRET_ACCESS_KEY and AWS_ACCESS_KEY_ID variables did the job!

Thor-Bjorgvinsson commented 2 years ago

ECS released a new AMI with the updated SSM Agent (ecs optimized ami version 20220421), still pending Fargate release.

SSM Agent commit to resolve this issue

Sohett commented 2 years ago

Any news concerning the fargate release ?

benoawfu commented 2 years ago

Without changing anything regarding the env variables I redeployed my ECS fargate instances and with the latest awscli this works fine now.

Thor-Bjorgvinsson commented 2 years ago

Fargate has completed release of the new agent

andarocks commented 1 year ago

Hi Guys,

I have used the ECS checker and this is the below result:

-------------------------------------------------------------
Prerequisites for check-ecs-exec.sh v0.7
-------------------------------------------------------------
  jq      | OK (/opt/homebrew/bin/jq)
  AWS CLI | OK (/opt/homebrew/bin/aws)

-------------------------------------------------------------
Prerequisites for the AWS CLI to use ECS Exec
-------------------------------------------------------------
  AWS CLI Version        | OK (aws-cli/2.11.9 Python/3.11.2 Darwin/22.4.0 source/arm64 prompt/off)
  Session Manager Plugin | OK (1.2.463.0)

-------------------------------------------------------------
Checks on ECS task and other resources
-------------------------------------------------------------
Region : eu-west-1
Cluster: app-service-cluster-test
Task   : b460c8c1bb334429a39ff7a4b1bad180
-------------------------------------------------------------
  Cluster Configuration  |
     KMS Key       : Not Configured
     Audit Logging : DEFAULT
     S3 Bucket Name: Not Configured
     CW Log Group  : Not Configured
  Can I ExecuteCommand?  | arn:aws:iam::117038214493:user/cli-admin
     ecs:ExecuteCommand: allowed
     ssm:StartSession denied?: allowed
  Task Status            | RUNNING
  Launch Type            | Fargate
  Platform Version       | 1.4.0
  Exec Enabled for Task  | OK
  Container-Level Checks |
    ----------
      Managed Agent Status
    ----------
         1. RUNNING for "app-service-test-container"
    ----------
      Init Process Enabled (app-service-task-definition-test:18)
    ----------
         1. Disabled - "app-service-test-container"
    ----------
      Read-Only Root Filesystem (app-service-task-definition-test:18)
    ----------
         1. Disabled - "app-service-test-container"
  Task Role Permissions  | arn:aws:iam::117038214493:role/TuskProdECSTaskRole
     ssmmessages:CreateControlChannel: allowed
     ssmmessages:CreateDataChannel: allowed
     ssmmessages:OpenControlChannel: allowed
     ssmmessages:OpenDataChannel: allowed
  VPC Endpoints          |
    Found existing endpoints for vpc-00bfcd992d7f50681:
      - com.amazonaws.eu-west-1.ssmmessages
      - com.amazonaws.eu-west-1.s3
      - com.amazonaws.vpce.eu-west-1.vpce-svc-0e7975f61ffb9d0f7
  Environment Variables  | (app-service-task-definition-test:18)
       1. container "app-service-test-container"
       - AWS_ACCESS_KEY: not defined
       - AWS_ACCESS_KEY_ID: not defined
       - AWS_SECRET_ACCESS_KEY: not defined

All the configuration seems to be okay... the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY is not defined. But I am still getting TargetNotConnectedException. Am I missing something?

AWS CLI version: 2.11.9

andarocks commented 1 year ago

But the AWS_ACCESS_KEY and AWS_SECRET_ACCESS_KEY variables are defined in the .env file inside the container. I hope that's not the issue.

istvanfedak-nbcu commented 11 months ago

I'm experiencing this exact same issue. The aws ecs execute-command was working for me last week and it stopped working.

obaqueiro commented 6 months ago

Did anyone else stumbled to this problem again?

We started getting this issue again. There are no AWS_ACCESS_KEY / SECRET defined and check-ecs-exec.sh shows everything OK (green and yellow)

-------------------------------------------------------------
Prerequisites for check-ecs-exec.sh v0.7
-------------------------------------------------------------
  jq      | OK (/usr/bin/jq)
  AWS CLI | OK (/usr/local/bin/aws)

-------------------------------------------------------------
Prerequisites for the AWS CLI to use ECS Exec
-------------------------------------------------------------
t.21 prompt/off)
  Session Manager Plugin | OK (1.2.497.0)

-------------------------------------------------------------
Checks on ECS task and other resources
-------------------------------------------------------------
Region : us-east-1
Cluster: cluster-name
Task   : 949fd5e48ebf4ba4b895176cb0c36d50
  Cluster Configuration  |
     KMS Key       : Not Configured
     Audit Logging : DEFAULT
     S3 Bucket Name: Not Configured
     CW Log Group  : Not Configured
  Can I ExecuteCommand?  | arn:aws:iam::xxx:user/deployment
     ecs:ExecuteCommand: allowed
     ssm:StartSession denied?: allowed
  Task Status            | RUNNING
  Platform Version       | 1.4.0
  Exec Enabled for Task  | OK
  Container-Level Checks | 
    ----------
      Managed Agent Status
    ----------
         1. RUNNING for "metabase_app_dev"
    ----------
      Init Process Enabled (metabase_dev:3)
    ----------
         1. Disabled - "metabase_app_dev"
    ----------
      Read-Only Root Filesystem (metabase_dev:3)
    ----------
         1. Disabled - "metabase_app_dev"
  Task Role Permissions  | arn:aws:iam::xxx:role/metabase_ecsTaskExecutionRole_dev
     ssmmessages:CreateControlChannel: allowed
     ssmmessages:CreateDataChannel: allowed
     ssmmessages:OpenControlChannel: allowed
     ssmmessages:OpenDataChannel: allowed
  VPC Endpoints          | 
    Found existing endpoints for vpc-081adc23fcb697c58:
      - com.amazonaws.us-east-1.execute-api
      - com.amazonaws.us-east-1.secretsmanager
      - com.amazonaws.vpce.us-east-1.vpce-svc-0256367e65088edb5
      - com.amazonaws.us-east-1.ssmmessages
  Environment Variables  | (metabase_dev:3)
       1. container "metabase_app_dev"
       - AWS_ACCESS_KEY: not defined
       - AWS_ACCESS_KEY_ID: not defined
       - AWS_SECRET_ACCESS_KEY:: not defined

$ aws ecs execute-command  --cluster cluster-name --task 949fd5e48ebf4ba4b895176cb0c36d50 --container  metabase_app_dev --command 'sh' --interactive

The Session Manager plugin was installed successfully. Use the AWS CLI to start a session.

An error occurred (TargetNotConnectedException) when calling the ExecuteCommand operation: The execute command failed due to an internal error. Try again later.