aws / amazon-ssm-agent

An agent to enable remote management of your EC2 instances, on-premises servers, or virtual machines (VMs).
https://aws.amazon.com/systems-manager/
Apache License 2.0
1.03k stars 323 forks source link

ssm agent uses "incorrect" network interface in SageMaker Notebook Instance #449

Closed tbugfinder closed 2 years ago

tbugfinder commented 2 years ago

Hello,

I'd like to utilize SSM capabilities for SageMaker Notebook Instances. As SageMaker Notebook Instances are managed (anyway) instances, hybrid activations was configured so that SageMaker Notebook Instances are managed like hybrid (or onprem) nodes.

The registration finishes successfully, however initial SSM startup fails. Given SageMaker setup is restricted to VPC-only, so the SageMaker Notebook Instance has ~8 interfaces - one of those the proper ENI within the private VPC (which includes endpoints.)

Initial connection seems to get established to a public SSM IP instead of the Endpoint. Therefore I'm assuming SSM doesn't utilize the proper network interface.

2022-05-26 20:44:12 INFO [ssm-agent-worker] Start to listen to Core Agent health channel
2022-05-26 20:44:12 INFO [ssm-agent-worker] [StartupProcessor] Executing startup processor tasks
2022-05-26 20:44:12 INFO [ssm-agent-worker] [StartupProcessor] Write to serial port: Amazon SSM Agent v3.0.1124.0 is running
2022-05-26 20:44:12 INFO [ssm-agent-worker] [StartupProcessor] Write to serial port: OsProductName: Amazon Linux
2022-05-26 20:44:12 INFO [ssm-agent-worker] [StartupProcessor] Write to serial port: OsVersion: 2
2022-05-26 20:46:21 INFO [ssm-agent-worker] Entering SSM Agent hibernate - error occurred in RequestManagedInstanceRoleToken:
RequestError: send request failed
caused by: Post "https://ssm.eu-west-1.amazonaws.com/": dial tcp 52.95.125.3:443: i/o timeout

After stopping SSM service and running amazon-ssm-agent in foreground it properly connects to SSM endpoint and a session can be opened.

tbugfinder commented 2 years ago

I figured out that systemctl sets a specific ip namespace on those instances. After removing this configuration it works as expected.

dominikarnoldi commented 1 year ago

Hi @tbugfinder ,

how did you fix it ?