aws-samples / awsome-distributed-training

Collection of best practices, reference architectures, model training examples and utilities to train large models on AWS.
MIT No Attribution
136 stars 58 forks source link

Cluster creation fails, but CloudWatch logs are empty for HyperPod #189

Closed cfregly closed 4 months ago

cfregly commented 4 months ago

Cluster creation fails, but CloudWatch logs are empty for HyperPod.

We see a message to find error details in CloudWatch but CloudWatch does not display any logs.

cfregly commented 4 months ago

While rare, the logs may not be pushed to CloudWatch under certain cluster-creation failures. This may be due to a bad GPU or other health-check error that occurs during cluster creation.

Retry the cluster creation and verify the logs are showing up in CloudWatch per the following link: https://catalog.workshops.aws/sagemaker-hyperpod/en-US/04-advanced/03-troubleshooting#logs

Contact SageMaker HyperPod support if the cluster is still not creating.