aws-samples / awsome-distributed-training

Collection of best practices, reference architectures, model training examples and utilities to train large models on AWS.
MIT No Attribution
204 stars 86 forks source link

SageMaker Hyperpod "Target not connected" #280

Open sean-smith opened 7 months ago

sean-smith commented 7 months ago

If you're trying to connect to your SageMaker Hyperpod cluster and you see an error "An error occurred (TargetNotConnected)", there's a couple of common causes:

An error occurred (TargetNotConnected) when calling the StartSession operation: sagemaker-cluster:..._controller-machine-i-... is not connected.
kex_exchange_identification: Connection closed by remote host
Connection closed by UNKNOWN port 65535

To troubleshoot do a few things:

  1. Check your aws credentials are configured for the right account:
    aws sts get-caller-identity --query Account --output text
  2. Check to see the region is correct:
    aws configure get region

If those don't work, try and ssm into a compute node, you'll need the cluster-id, worker-group name and instance-id which you can get from the aws sagemaker list-cluster-nodes --cluster-name <cluster-name> CLI call.

aws ssm start-session \
    --target sagemaker-cluster:<cluster-id>_worker-group-1-<instance-id>

Once you're there you can get the ip address of the controller node by running:

sudo cat /opt/ml/config/resource_config.json | jq | grep -5 controller-machine

That'll show:

      "Name": "controller-machine",
      "InstanceType": "ml.m5.12xlarge",
      "Instances": [
        {
          "InstanceName": "controller-machine-1",
          "AgentIpAddress": "172.16.90.220",
          "CustomerIpAddress": "10.1.39.83",
          "InstanceId": "i-0defeb24a1f5dfe85"
        }
      ]

Use the CustomerIpAddress 10.1.39.83 to SSH into headnode from that compute node:

ssh 10.1.39.83
m-ali4721 commented 4 months ago

Hi Sean,

Thank you for the detailed message on this, I am in similar situation but when I access my compute node, I get Permission denied (publickey). Can we replace the controller machine?

sean-smith commented 4 months ago

@m-ali4721 you can't replace the headnode but you can add a login node that'll act as a jump box to the headnode. See instructions on how to do that here: https://catalog.workshops.aws/sagemaker-hyperpod/en-US/05-advanced/07-login-node

Also for the command to access the compute node:

aws ssm start-session \
    --target sagemaker-cluster:<cluster-id>_worker-group-1-<instance-id>

You shouldn't need a SSH keypair, this uses SSM in lieu of SSH. Are you getting the issue here or on the compute node trying to connect to the headnode?

m-ali4721 commented 4 months ago

Yes, I successfully access the compute node but while doing "ssh privateIP" of the controller machine from the compute node, I am receiving error: Permission denied (publickey)

github-actions[bot] commented 1 month ago

This issue is stale because it has been open for 30 days with no activity.