Open sean-smith opened 7 months ago
Hi Sean,
Thank you for the detailed message on this, I am in similar situation but when I access my compute node, I get Permission denied (publickey). Can we replace the controller machine?
@m-ali4721 you can't replace the headnode but you can add a login node that'll act as a jump box to the headnode. See instructions on how to do that here: https://catalog.workshops.aws/sagemaker-hyperpod/en-US/05-advanced/07-login-node
Also for the command to access the compute node:
aws ssm start-session \
--target sagemaker-cluster:<cluster-id>_worker-group-1-<instance-id>
You shouldn't need a SSH keypair, this uses SSM in lieu of SSH. Are you getting the issue here or on the compute node trying to connect to the headnode?
Yes, I successfully access the compute node but while doing "ssh privateIP" of the controller machine from the compute node, I am receiving error: Permission denied (publickey)
This issue is stale because it has been open for 30 days with no activity.
If you're trying to connect to your SageMaker Hyperpod cluster and you see an error "An error occurred (TargetNotConnected)", there's a couple of common causes:
To troubleshoot do a few things:
If those don't work, try and ssm into a compute node, you'll need the
cluster-id
,worker-group
name andinstance-id
which you can get from theaws sagemaker list-cluster-nodes --cluster-name <cluster-name>
CLI call.Once you're there you can get the ip address of the controller node by running:
That'll show:
Use the CustomerIpAddress
10.1.39.83
to SSH into headnode from that compute node: