Closed cfregly closed 6 months ago
Run the following to check the status of the cluster:
aws sagemaker describe-cluster --cluster-name <cluster-name>
If you see an error like, "FailureMessage": "We currently do not have sufficient capacity for the requested instance type(s). Please try again."
you should double-check that your instance types are in the correct AZ:
aws ec2 describe-instance-type-offerings --location-type availability-zone --filters Name=instance-type,Values=p5.48xlarge --region us-west-2 --output table
You should then check the CloudWatch logs to see if there are any lifecycle script creation failures.
HyperPod cluster fails to create - ROLLBACK