aws-samples / awsome-distributed-training

Collection of best practices, reference architectures, model training examples and utilities to train large models on AWS.
MIT No Attribution
136 stars 58 forks source link

How do I diagnose a bad node in HyperPod? #206

Closed cfregly closed 4 months ago

cfregly commented 4 months ago

HyperPod is designed for resiliency and will automatically replace a bad node in your cluster.

If you want to manually diagnose a bad node in your HyperPod cluster, follow these steps: https://catalog.workshops.aws/sagemaker-hyperpod/en-US/04-advanced/09-diagnose-bad-node