aws-samples / awsome-distributed-training

Collection of best practices, reference architectures, model training examples and utilities to train large models on AWS.
MIT No Attribution
136 stars 58 forks source link

"FailureMessage": "Instance i-XXX failed to provision with the following error: \"Lifecycle scripts did not run successfully." #188

Closed cfregly closed 4 months ago

cfregly commented 4 months ago

"FailureMessage": "Instance i-XXX failed to provision with the following error: \"Lifecycle scripts did not run successfully. Ensure the scripts exist in provided S3 path, are accessible, and run without errors. Please see CloudWatch logs for lifecycle script execution details.\" Note that multiple instances may be impacted."

cfregly commented 4 months ago

Check the CloudWatch logs. This might be an issue with the lifecycle configs, but could be a capacity issue, quota issue, wrong AZ, wrong subnet, or similar.

If the CloudWatch logs do not provide enough information, please notify SageMaker HyperPod support and we'll help you debug the issue further.