aws / amazon-ssm-agent

An agent to enable remote management of your EC2 instances, on-premises servers, or virtual machines (VMs).
https://aws.amazon.com/systems-manager/
Apache License 2.0
1.04k stars 323 forks source link

SSM agent under Fargate using the new ECS Exec feature is crashing #361

Open youssefNM opened 3 years ago

youssefNM commented 3 years ago

Hey there,

We are trying to setup the new AWS ECS Exec with our Fargate services to be able to run commands on tasks, we followed the setup article but somehow the SSM agent goes into "STOPPED" status when the task starts, i was able to check the agent logs inside the container and what i can see is this error :

user@ip-xx-xxx-xxx-xx:/opt/app$ sudo cat /var/log/amazon/ssm/errors.log
2021-03-24 20:14:23 ERROR [run @ agent.go.104] error occurred when starting amazon-ssm-agent: failed to start message bus, failed to start health channel: failed to listen on the channel: ipc:///var/lib/amazon/ssm/ipc/health, address in use

Also checking the running processes inside the container, i can see the two ssm processes (amazon-ssm-agent and ssm-agent-worker), and there is no duplication of the SSM agent process that might explain the "address in use" error :

root        21  0.0  0.3 1398768 14552 ?       Ssl  20:24   0:00 /managed-agents/execute-command/amazon-ssm-agent
root        40  0.0  0.8 1336320 32196 ?       Sl   20:24   0:00 /managed-agents/execute-command/ssm-agent-worker

amazon-ssm-agent version: v3.1.36.0 Linux : Debian GNU/Linux 10 (buster)

Any idea why this is happening ??

youssefNM commented 3 years ago

I did some further investigations on this to see what process is using the file /var/lib/amazon/ssm/ipc/health and i can see that only one process (pid21 from above) which is the amazon-ssm-agent is accessing that file :

root@ip-xx-xxx-xxx-xxx:/opt/app# lsof /var/lib/amazon/ssm/ipc/health
COMMAND   PID USER   FD   TYPE             DEVICE SIZE/OFF  NODE NAME
amazon-ss  21 root   10u  unix 0xffff8880bd03b000      0t0 23964 /var/lib/amazon/ssm/ipc/health type=STREAM
amazon-ss  21 root   11u  unix 0xffff8880c0dd7c00      0t0 21918 /var/lib/amazon/ssm/ipc/health type=STREAM
amazon-ss  21 root   15u  unix 0xffff8880bbff2000      0t0 22317 /var/lib/amazon/ssm/ipc/health type=STREAM
amazon-ss  21 root   16u  unix 0xffff8880bd03dc00      0t0 24945 /var/lib/amazon/ssm/ipc/health type=STREAM
amazon-ss  21 root   17u  unix 0xffff8880bbe52c00      0t0 25002 /var/lib/amazon/ssm/ipc/health type=STREAM
amazon-ss  21 root   18u  unix 0xffff8880e7969000      0t0 28820 /var/lib/amazon/ssm/ipc/health type=STREAM

So the error is still something odd and inexplicable at this time giving that only one ssm-agent process is accessing the /var/lib/amazon/ssm/ipc/health file.

dsouzajude commented 3 years ago

Hey there! Isn't this expected behaviour as per article:

Article

This is made possible by bind-mounting the necessary SSM agent binaries into the container.

So the SSM Agent will be mounted on to the container and run. IMO, you don't need both the ECS Exec feature and AWS SSM agent to run commands at the same time, both will give you access inside the container so you can choose either one option depending on your usecase.

If you simply want to do debugging of an app inside the container, then you can use the ECS exec option and it will auto mount the SSM agent for you. You don't need to install the amazon-ssm-agent again inside the container.

If you want to just run a SSM managed session inside the container or run commands, then you can manually install the aws-ssm-agent into the image and not have to enable the ECS exec feature for the task.

Hope this helps! If not, maybe you can explain a bit what your usecase is about.

nscott commented 3 years ago

@dsouzajude as you noted, the ecs exec command uses the SSM agent. I hit this bug multiple times a week - probably half my containers have this issue. I do not install the SSM agent manually. All I want to do is use ecs exec.

baonguyen84 commented 3 years ago

We hit the same issue here. Using Fargate and not install ssm agent. The “not always work” workaround solution is killing the task so the ecs service could launch a new task. Sometimes it works, sometimes it does not.

kartikrao commented 3 years ago

Same issue, https://github.com/aws-containers/amazon-ecs-exec-checker reports that everything is setup correctly.

But attempts to connect crash the agent.

youssefNM commented 3 years ago

@dsouzajude We are not installing any side amazon-ssm-agent inside the container, we followed step-by-step the official AWS article to setup ECS EXEC with our Fargate environment. the amazon-ssm-agent shown in the results i attached above is from enabling ECS EXEC with the Fargate task.

youssefNM commented 3 years ago

sorry closed the issue by mistake! This is still happening and preventing us from widely adopting the ECS EXEC feature with Fargate.

shakscode commented 2 years ago

I do face this issue...any resolution on this??

shakscode commented 2 years ago

works for me!!!!! it was my image's behaviour.

edmundcraske-bjss commented 2 years ago

On my project we are seeing the same issue, where sometimes containers cannot be 'ECS Exec'ed into, and aws ecs describe-tasks shows ExecuteCommandAgent is STOPPED rather than RUNNING. We can stop those containers, and they are replaced with new ones, and ECS Exec then generally works, but it's not clear why the agent is stopping. Is there some way to at least have a container where the agent is stopped fail healthchecks and get replaced?

youssefNM commented 2 years ago

Any update on this? the issue is still bitting us, most of our Fargate containers fail to accept remote execute commands from ecs exec, the SSM agent shows as STOPPED in all this affected Fargate tasks, and as mentioned by many people the issue happens randomly,

I even opened a support ticket using our AWS support tier in the past and they confirmed the issue and said ECS+SSM technical team rolled a fix but it seems that didn't fix it, their explanation of the issue :

The issue is that the Fargate agent lost the SSM agent status because the process somehow lost the uuid set by containerd. The SSM agent is actually running inside the container. Once it retries to start the SSM agent process but since the SSM agent is actually already running, it can't be started again, it is throwing the address in use error.

The confirmation of the fix they rolled out :

I see that the ECS+SSM team had rolled out an update for the fix and date of completion for this deployments was between 2nd - 5th August 2021. 

This issue should be prioritized as i believe many Fargate customers are impacted by this issue if they use ecs exec feature!

Similar ticket was also reported in your AWS Forum https://forums.aws.amazon.com/message.jspa?messageID=980336!

kylemacfarlane commented 2 years ago

I had this issue a lot when ECS Exec first launched and it did seem like it got fixed but it now seems to have completely regressed. About a week ago I couldn't login to any container as the agent had stopped on all of them. I kept launching extra containers and it didn't work until the 4th.

mtommila commented 2 years ago

According to AWS support, a workaround is to use "ssm start-session" instead of ECS Exec. It essentially seems to do the same thing but it should work even when ECS Exec is failing.

The trick is in using a target parameter that is in the format

ecs:clustername_taskid_containerruntimeid

Then you can run something like this to run a command on the container

aws ssm start-session --target ecs:clustername_taskid_containerruntimeid --document-name AWS-StartInteractiveCommand --parameters '{"command":["whatever command"]}'

Or to get an interactive session to the container, just

aws ssm start-session --target ecs:clustername_taskid_containerruntimeid

nscott commented 2 years ago

This has gotten significantly worse for me in the past few weeks. Not sure what changed, but feels like a regression.

daniel-0906 commented 2 years ago

Having the same problem above (ECS Fargate). The odd thing is that sometimes it works on some tasks and sometimes it doesn't. We didn't have this issue before, I believed it started around early November for us.

youssefNM commented 2 years ago

@mtommila The workaround is working, but it has limitations compared to ECS Exec, one of them is that you lose the default integration with S3 bucket for logs and audit, and it seems also that aws ssm start-session doesn't support granular IAM permissions to limit access compared to ECS EXEC which does support more policy condition keys!

According to AWS support, a workaround is to use "ssm start-session" instead of ECS Exec. It essentially seems to do the same thing but it should work even when ECS Exec is failing.

The trick is in using a target parameter that is in the format

ecs:clustername_taskid_containerruntimeid

Then you can run something like this to run a command on the container

aws ssm start-session --target ecs:clustername_taskid_containerruntimeid --document-name AWS-StartInteractiveCommand --parameters '{"command":["whatever command"]}'

Or to get an interactive session to the container, just

aws ssm start-session --target ecs:clustername_taskid_containerruntimeid

GrubLord commented 2 years ago

Having this issue myself, the agent is just STOPPED, no way to get in. =/

Would love to see a fix.

bhsp commented 2 years ago

Having the same issue. Using CloudFormation to deploy multiple ECS Fargate microservices. Used https://github.com/aws-containers/amazon-ecs-exec-checker to verify our set-up. Sometimes a task will have "running" for "Managed Agent Status" across all containers in the task, sometimes a task will have 1 or 2 containers as "stopped" and the remaining container in the task as "running". "Managed Agent Status" will sometimes be "stopped" on init (fresh deployment) and other times after some period of time. Since we're using CFN for deployments the inconsistencies in the "Managed Agent Status" is confusing. We'll see the "Managed Agent Status" crash and sometimes run on a Corretto container, same for DataDog and Fluent-Bit containers ... i.e., sometimes the agent works for days, sometimes it borks nearly right away.

bhsp commented 2 years ago

Follow-up: aws ssm start-session works in every case, even when for whatever reason the "Managed Agent Status" crashes on a container.

g3kr commented 2 years ago

Having the same issue. Used https://github.com/aws-containers/amazon-ecs-exec-checker to verify our set-up. most of the times a task will have "stopped" for "Managed Agent Status" across a few containers in the task. Can this be prioritized?

jtsinnott commented 2 years ago

I am having the same issue. ECS exec was previously working reliably as I had setup my task execution role to have these permissions: { "Effect": "Allow", "Action": [ "ssmmessages:CreateControlChannel", "ssmmessages:CreateDataChannel", "ssmmessages:OpenControlChannel", "ssmmessages:OpenDataChannel" ], "Resource": "*" }

Then it stopped working after the SSM upgrade (which we do not control). After som experimentation, I was able to get exec to work again by granting my application's AWS user account these same permissions. Why? Well my container has the AWS credentials injected into via environment file (as supported by task definitions). SSM seems to have changed to authenticate using whatever AWS credentials exist in the environment of the container. This is wrong, it should be using the exec role I believe. Not sure why this changed.

I believe the SSM security behavior should be reverted to use whatever the security context is of the task execution role, so that we do not have to grant our application user these permissions, and we can also continue to use environment variables for our application credentials.

nscott commented 2 years ago

This frustrated me so much, and happened so often, that I wrote a script to fallback to ssm start-session which works every single time.

I prefix my resources with the environment type (e.g. prod, beta). Feel free to remove the $env variable if you don't use a prefix for your resources; it may take a little massaging to work.

This also allows you to pass in an offset, so if a task is having a problem you can just increment the offset variable and get another task.

https://gist.github.com/nscott/169bbf6a10f4c4fbd6194b3cdc5707b7

andymac4182 commented 2 years ago

Anyone that is having issues have you tried using https://github.com/tedsmitt/ecsgo ? Curious if its something in the AWS CLI or the actual API.

If it doesn't work in ecsgo I am sure it wouldn't be hard to add that feature to fall back.

nscott commented 2 years ago

I haven't tried, but it's not something in the API either. It's the service on the container crashing. In my container start script I have a fix_ssm function.

For a while I tried to explicitly kill it and restart it. I have it set just to log at this point and I'll probably turn off the logging since it's so intermittent.

function fix_ssm() {
  echo "Trying to fix SSM"
  lsof /var/lib/amazon/ssm/ipc/health
  ps aux
  PID_TO_KILL=$(pidof /managed-agents/execute-command/amazon-ssm-agent)
  echo "Killing SSM agent ID " $PID_TO_KILL
  kill -9 $PID_TO_KILL
  rm -rf /var/lib/amazon/ssm/ipc/health
  echo "Relaunching SSM agent"
  /managed-agents/execute-command/amazon-ssm-agent &
}

# https://stackoverflow.com/questions/65218749/unable-to-start-the-amazon-ssm-agent-failed-to-start-message-bus
# https://forums.aws.amazon.com/message.jspa?messageID=981199#981199
# https://github.com/aws/amazon-ssm-agent/issues/361
# Use || true to always allow the command to succeed, even on development containers
# The SSM agent will be installed automatically on AWS ECS Fargate
(sleep 30 && (fix_ssm || true)) &

echo "Tailing SSM agent logs"
tail -f /var/log/amazon/ssm/amazon-ssm-agent.log &

lsof /var/lib/amazon/ssm/ipc/health || true

It also makes me furious that my forum post was archived with no way to view it even as an archive.

The agent often dies/is dead. There's a bunch of logging output I've captured in the past but at this point I just gave up and accept that it's not going to work 100% of the time. The ssm start-session fallback isn't as good since it doesn't drop you into the expected path, I don't know if the audit trail is the same, etc.

Another piece of feature creep in AWS that's very helpful but won't be supported correctly.

andymac4182 commented 2 years ago

Interestingly looking at SSM agent releases the version pushed by fargate at the moment is [Release 3.1.1260.0 - 2022-04-12] the next version has what seems to be a few bug fixes for initialization. https://github.com/aws/amazon-ssm-agent/releases

VishnuKarthikRavindran commented 2 years ago

I am having the same issue. ECS exec was previously working reliably as I had setup my task execution role to have these permissions: { "Effect": "Allow", "Action": [ "ssmmessages:CreateControlChannel", "ssmmessages:CreateDataChannel", "ssmmessages:OpenControlChannel", "ssmmessages:OpenDataChannel" ], "Resource": "*" }

Then it stopped working after the SSM upgrade (which we do not control). After som experimentation, I was able to get exec to work again by granting my application's AWS user account these same permissions. Why? Well my container has the AWS credentials injected into via environment file (as supported by task definitions). SSM seems to have changed to authenticate using whatever AWS credentials exist in the environment of the container. This is wrong, it should be using the exec role I believe. Not sure why this changed.

I believe the SSM security behavior should be reverted to use whatever the security context is of the task execution role, so that we do not have to grant our application user these permissions, and we can also continue to use environment variables for our application credentials.

Thanks @jtsinnott for reaching us. The issue that you had mentioned is resolved and further information about it can be found in this link - https://github.com/aws/amazon-ssm-agent/issues/435

andymac4182 commented 2 years ago

@VishnuKarthikRavindran How often does the ECS team upgrade the SSM agent in Fargate?

matthewhembree commented 2 years ago

Add a 👍 to https://github.com/aws/containers-roadmap/issues/1756 if you want the SSM version bumped.

AnatolyBuga commented 1 year ago

aws ssm start-session --target ecs:clustername_taskid_containerruntimeid

Error when calling StartSession operation: is not connected.

my command is aws ssm start-session --target ecs:UltimaF_838d773b17954bcfbbacf343fb4fea70_838d773b17954bcfbbacf343fb4fea70-2587323273 which is ecs:clustername_task_containerruntimeid

Any help/hints would be appreciated!

tordaale commented 1 year ago

aws ssm start-session --target ecs:clustername_taskid_containerruntimeid --document-name AWS-StartInteractiveCommand --parameters '{"command":["whatever command"]}'

Hi! I'm facing the same issue too, however, this command throws me an error for Target not being connected. Did you do something to register the instance? I've my cluster running in private subnet and the exec check is reporting all green checks.

The logic of building up the instance as you say makes sense to me, upon all exec-command runs I got in Session History an entry with this structure for the instance id, but it got Terminated always within 3 seconds..

leonsodhi-lf commented 1 year ago

@tordaale I can't guarantee this is the issue, but do you have a way for the tasks running in the private subnet to access AWS' Systems Manager endpoints? That might be an Internet gateway, NAT gateway, or maybe some kind of proxy. If not, and you don't want to open up access, you may need to set up a VPC endpoint

xx745 commented 1 year ago

few months passed, I'm facing the same problem. Anyone found a solution?

ganeshgk commented 11 months ago

According to AWS support, a workaround is to use "ssm start-session" instead of ECS Exec. It essentially seems to do the same thing but it should work even when ECS Exec is failing.

The trick is in using a target parameter that is in the format

ecs:clustername_taskid_containerruntimeid

Then you can run something like this to run a command on the container

aws ssm start-session --target ecs:clustername_taskid_containerruntimeid --document-name AWS-StartInteractiveCommand --parameters '{"command":["whatever command"]}'

Or to get an interactive session to the container, just

aws ssm start-session --target ecs:clustername_taskid_containerruntimeid

@mtommila is this workaround supposed to work only with EC2 launch type or also with Fargate? Any idea? Because I always get TargetNotConnected error for ECS with Fargate launchtype

aws ssm start-session --target ecs:clusternameredacted_37d7319a58d4420f90b063e365a8464d_37d7319a58d4420f90b063e365a8464d-3839356491

An error occurred (TargetNotConnected) when calling the StartSession operation: ecs::clusternameredacted_37d7319a58d4420f90b063e365a8464d_37d7319a58d4420f90b063e365a8464d-3839356491 is not connected
mtommila commented 11 months ago

I have only used it with Fargate so I can only say that it works with Fargate.

ganeshgk commented 11 months ago

@mtommila any idea what could be causing an issue here? Exported the keys & got the runtime & taks id by describing the task, when running throws target not connected error.

mtommila commented 11 months ago

At least one thing I have noticed is that if you set PidMode to task then on some Docker images (but not all) you just won't be able to connect to the container with SSM. No idea why, though

nojeffrey commented 10 months ago

Finally managed to get logs off the container. For us the agent was crashing because NO_PROXY wasn't set, and so it couldn't connect to http://169.254.170.2 to pull metadata with the error: ERROR error fetching the instanceID, v3 container metadata: incorrect status code 404 But once that was set, it was still crashing as it couldn't connect out to https://ssmmessages.region.amazonaws.com/v1/control-channel... once we set HTTP_PROXY and HTTPS_PROXY the agent finally stayed in the RUNNING state.

gdm commented 3 months ago

For me it starts working only when I've changed ReadonlyRootFilesystem to false

Additional info;

"launchType": "FARGATE"
LinuxParameters:
  InitProcessEnabled: True
Docker image derived from amazoncorretto:17-al2023-jdk
RAM: 1GB
/managed-agents/execute-command/amazon-ssm-agent --version
SSM Agent version: 3.2.2303.0