aws / amazon-ssm-agent

An agent to enable remote management of your EC2 instances, on-premises servers, or virtual machines (VMs).
https://aws.amazon.com/systems-manager/
Apache License 2.0
1.03k stars 323 forks source link

Session Manager start-session hangs when root partition full #170

Open iancward opened 5 years ago

iancward commented 5 years ago

I dunno if this is the appropriate place for this, but when I attempt to start a session (either via the AWS Console or the CLI--via the session-manager-plugin) onto an EC2 will a full root partition, it just hangs. In the console I get a blank screen for a long time and then eventually a blinking cursor that doesn't work.

In the CLI (via session-manager-plugin), I get a message that it's starting a session but then it just hangs. The CLI/plugin doesn't respond to Ctrl+C or Ctrl+D; in fact, I have to start a new terminal on my workstation and kill the CLI command and the plugin command.

nitikagoyal87 commented 5 years ago

Thanks for reaching out to us. We will investigate this.

Millerborn commented 4 years ago

Was this ever resolved? I'm having a similar problem where I just see a black screen in the console, if I click, I get the cursor that doesn't do anything.

TajMahPaul commented 4 years ago

I am also having this issue

IrinaTerlizhenko commented 4 years ago

Hi, I'm experiencing this issue as well. Are there any updates on whether it is going to be fixed? It would be great to at least get a descriptive error message if the connection to instances with no space is impossible. Right now the AWS Console / CLI just hangs without any visible reason.

xacaxulu commented 4 years ago

+1 Using latest Ubuntu 18 AMI with ssm agent running and installed and necessary SSM/CloudWatch policies attached to role. Weirdest thing, it happens on some instances and not on others. Seems like a bug.

iniinikoski commented 4 years ago

Unfortunately this is still happening. AWS, you should really do something here.

cholletjo commented 4 years ago

Hello,

We experience the same issue on Red Hat 7.7. We couldn't reach the instance through Session Manager once the partition /var/ was full.

On other side, we observed a different behavior, when partition /var/log was full, the machine was still reachable. Anyway, when we rely on Session Manager to get remote access to server, we would expect to have access to EC2 even if filesystem full.

rossmckelvie commented 3 years ago

In all cases where Session Manager has not been able to successfully open a connection due to disk full, I'm able to use SSH to get access. We would like to remove SSH and switch solely to Session Manager, but that doesn't seem possible with longstanding issues like this and we are leaving SSH as a backup, where we can open the ports and distribute the pem key as needed during emergencies.

crawforde commented 3 years ago

This is a bit of a circular problem given that the way you're supposed to check the available disk space on a volume is by opening a terminal session in the instance. 🤦‍♀️ https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-describing-volumes.html

jackdcasey commented 3 years ago

Not sure if it's totally related, I'm running into an issue with the start-session command hanging, when the target instance is offline. As @iancward mentioned, ctrl+c, etc does not exit. Closing and re-opening the terminal is necessary.

I haven't had the opportunity to really dig into it, but I took a quick look at the code for the CLI, and found this:

https://github.com/aws/aws-cli/blob/master/awscli/customizations/sessionmanager.py

        try:
            # ignore_user_entered_signals ignores these signals
            # because if signals which kills the process are not
            # captured would kill the foreground process but not the
            # background one. Capturing these would prevents process
            # from getting killed and these signals are input to plugin
            # and handling in there
            with ignore_user_entered_signals():
                # call executable with necessary input
                check_call(["session-manager-plugin",
                            json.dumps(response),
                            region_name,
                            "StartSession",
                            profile_name,
                            json.dumps(parameters),
                            endpoint_url])
            return 0

Looks like the terminate signals are being swallowed intentionally? I'm not totally sure, but I reckon this ties into things 😄

twhetzel commented 2 years ago

Any updates on resolving this issue? I've run into the same problem.

gaalandr commented 2 years ago

Any updates please?

twhetzel commented 2 years ago

I've heard I should try to increase the volume size, but it's not clear if this will delete all data on the disk.

IdrisAbdul-Hussein commented 2 years ago

I've heard I should try to increase the volume size, but it's not clear if this will delete all data on the disk.

Hi @twhetzel, increasing the volume size should not delete the data. However, if you increase the EBS volume size, you will need to access the instance and perform some commands to extend file system to use that extra capacity. See the below links for Linux and Windows instances:

Windows: https://aws.amazon.com/premiumsupport/knowledge-center/expand-ebs-root-volume-windows/ or https://docs.aws.amazon.com/AWSEC2/latest/WindowsGuide/recognize-expanded-volume-windows.html

Linux: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/recognize-expanded-volume-linux.html

asyschikov commented 2 years ago

This is a very important issue. I had an EC2 instance that I used SSM to connect to. It had no SSH keys, was located in a private subnet. It ran out of space and it got essentially "bricked" because SSM stopped working. SSM is a critical connectivity tool and an instance becoming unaccessible for no good reason is a huge risk.

pplu commented 1 year ago

It is an important operational requirement to log in to an instance with its root partition full. If SSM session manager cannot handle this, SSH, which has no problem whatsover under these conditions, is still needed as a backup method (with all that it implies).

@nitikagoyal87 Is there at least any type of workaround that we can apply to make SSM session manager work when a root volume is full? Given the time this has been open, is this being worked on?

wstewartlyra commented 1 year ago

I'm surprised to see this issue unaddressed. In the EKS best practice docs it is suggested as a best practice to disable SSH and use SSM instead.

Losing all access to a host in a case like this can be extremely painful.

mhare-bokf commented 1 year ago

This is an issue that needs to be addressed.

jethrocarr commented 1 year ago

Was pretty stunned to find out about this bug and how long it's been open for. SSM is a great tool and can replace SSH for us almost completely... except for this one critical issue blocking it.

If some pet has failed and run out of space in a weird way the last thing I want to spend time doing is to go and mount the disk on another machine and expand it just to get enough working space I can boot and SSM into the host to figure out what is actually going wrong.

dpwrussell commented 9 months ago

@nitikagoyal87 Was there ever an output of your initial investigation of this?

Thanks

gopher55 commented 4 months ago

the best solution I found for linux boxes, was to

  1. stop the "full root" instance
  2. copy the disk information (disk id) from aws console about its disk.
  3. lets get real careful now.
  4. detach the root volume from the "full root" instance
  5. create a temporary t2.micro (free tier eligible) instance of same O/S
  6. once its up, attach to it, the root volume from the "full root" instance you detached in 4 above.
  7. login and become root on the new t2 micro instance.
  8. mkdir /tmp/mnt (suffer through the "its already there" if presented..)
  9. perform an fdisk -l to determine the device information for the attached "full root" disk.
  10. Heres the trick!! mount the "full root" disk onto the t2.micro nosuid : mount -o nosuid,rw (the disk id)/tmp/mnt
  11. clean up the possible space hogs in : /tmp/mnt/var/log/ /tmp/mnt/tmp (core dumps and other unexpected things).
  12. when you have (hopefully) found all the wads: umount /tmp/mnt
  13. in the aws console, detach the "full root" disk from the t2.micro, and re-attach it to the original instance.
  14. start the original "full root" instance. (it should come up and allow you to log in again.)
  15. dump the t2.micro.

HTH.

dmitry-livchak-qco commented 3 months ago

This is an absolutely critical bug. If SSM absolutely must use a disk, we should be able to set up a separate partition to keep it working even if the rest of the system isn't. If SSM can't be relied on as a critical investigation tool, server admins will have to rely on SSH. This increases complexity, security risks and goes against AWS best practices for EKS, to say the least. @VishnuKarthikRavindran as you have been contributing the most recently, is there a chance you could raise this issue with the product team to give it a priority?

lobeck commented 1 month ago

As the previous comments already stated, it's critical, that SSM keeps working even if the disk is full. sshd has this capability since ever and especially in those situations, you need to rely on access.

Anything else like SSH keys is nowadays outdated and insecure but seemingly, ssm does not yet have the maturity to replace it properly.