Open mskrajnowski opened 4 years ago
Thanks for reaching out to us! Could you please provide SSM Agent logs for the duration of when NAT instance was replaced and a new start session request was initiated? You can refer to below documentation on how to retrieve SSM Agent logs. https://docs.aws.amazon.com/systems-manager/latest/userguide/sysman-agent-logs.html
When a NAT instance is replaced, other instances behind that NAT instance become unreachable using SSH over
aws ssm start-session
and session manager in general.The SSM agent is still working and responds to
aws ssm send-command
, but any open sessions are interrupted and new ones either freeze while trying to connect or throw:Restarting the SSM agent using
aws ssm send-command
seems to resolve the issue.Expected behavior
SSM agent should be able to recover from brief network outages like the one that happens when a NAT instance is replaced.
If it's impossible to maintain current sessions, it should at least be possible to create a new ssh session without having to restart the SSM agent.
Steps to reproduce
Install necessary tools
Download example gist with a terraform configuration that creates:
Create the infrastructure
SSH into the instance to test that SSM agent works properly
Replace the NAT instance with a new one
Try to SSH into the instance (should fail with an error or freeze)
Restart the SSH agent
SSH into the instance (should work fine after agent restart)