aws-samples / awsome-distributed-training

Collection of best practices, reference architectures, model training examples and utilities to train large models on AWS.
MIT No Attribution
136 stars 58 forks source link

One of the cluster GPUs is failing, it seems. #201

Closed cfregly closed 4 months ago

cfregly commented 4 months ago

One of the cluster GPUs is failing, it seems.

cfregly commented 4 months ago

HyperPod was designed for resiliency against GPU failures which are not abnormal for large GPU clusters. Typically, HyperPod will replace the failed instance in a matter of minutes. In the meantime, you can review if there are any ECC errors with the GPUs as follows:

1/ run Nvidia bug report on the node where error occured, the bug report file is found in /usr/bin/nvidia-bug-report.sh

2/ search the file for "ECC Errors", there should be 8 sections, one for each GPU, reporting any ECC errors.

3/ if you identify a specific gpu/gpus with ECC errors, note the GPU UUID for record keeping, and try to reset that gpu from that node with sudo nvidia-smi --gpu-reset=<GPU Index> or sudo nvidia-smi --gpu-reset, although if there are any processess running on the GPU, Nvidia-smi may error out.