aws-samples / awsome-distributed-training

Collection of best practices, reference architectures, model training examples and utilities to train large models on AWS.
MIT No Attribution
180 stars 74 forks source link

HyperPod cluster fails to create #204

Closed cfregly closed 6 months ago

cfregly commented 6 months ago

HyperPod cluster fails to create - ROLLBACK

cfregly commented 6 months ago

Run the following to check the status of the cluster:

aws sagemaker describe-cluster --cluster-name <cluster-name>

If you see an error like, "FailureMessage": "We currently do not have sufficient capacity for the requested instance type(s). Please try again."

you should double-check that your instance types are in the correct AZ:

aws ec2 describe-instance-type-offerings --location-type availability-zone  --filters Name=instance-type,Values=p5.48xlarge --region us-west-2 --output table

You should then check the CloudWatch logs to see if there are any lifecycle script creation failures.