aws / amazon-ssm-agent

An agent to enable remote management of your EC2 instances, on-premises servers, or virtual machines (VMs).
https://aws.amazon.com/systems-manager/
Apache License 2.0
1.06k stars 326 forks source link

AWS-StartSSHSession broken after NAT instance change #266

Open mskrajnowski opened 4 years ago

mskrajnowski commented 4 years ago

When a NAT instance is replaced, other instances behind that NAT instance become unreachable using SSH over aws ssm start-session and session manager in general.

The SSM agent is still working and responds to aws ssm send-command, but any open sessions are interrupted and new ones either freeze while trying to connect or throw:

An error occurred (TargetNotConnected) when calling the StartSession operation: i-***** is not connected.

Restarting the SSM agent using aws ssm send-command seems to resolve the issue.

Expected behavior

SSM agent should be able to recover from brief network outages like the one that happens when a NAT instance is replaced.

If it's impossible to maintain current sessions, it should at least be possible to create a new ssh session without having to restart the SSM agent.

Steps to reproduce

  1. Install necessary tools

    • terraform (tested on version 0.12.24)
    • aws cli
    • aws cli session manager plugin
  2. Download example gist with a terraform configuration that creates:

    • VPC with private and public subnets
    • NAT EC2 instance in the public subnet
    • Route table for private subnet routing outgoing traffic through the NAT instance
    • EC2 instance in the private subnet
  3. Create the infrastructure

    terraform init
    terraform apply
  4. SSH into the instance to test that SSM agent works properly

    bash ssh-to-instance.sh
  5. Replace the NAT instance with a new one

    terraform taint aws_instance.nat
    tarraform apply
  6. Try to SSH into the instance (should fail with an error or freeze)

    bash ssh-to-instance.sh
  7. Restart the SSH agent

    bash restart-ssh-agent.sh
  8. SSH into the instance (should work fine after agent restart)

    bash ssh-to-instance.sh
nitikaaws commented 3 years ago

Thanks for reaching out to us! Could you please provide SSM Agent logs for the duration of when NAT instance was replaced and a new start session request was initiated? You can refer to below documentation on how to retrieve SSM Agent logs. https://docs.aws.amazon.com/systems-manager/latest/userguide/sysman-agent-logs.html