One of the cluster GPUs is failing, it seems.

HyperPod was designed for resiliency against GPU failures which are not abnormal for large GPU clusters. Typically, HyperPod will replace the failed instance in a matter of minutes. In the meantime, you can review if there are any ECC errors with the GPUs as follows:

1/ run Nvidia bug report on the node where error occured, the bug report file is found in /usr/bin/nvidia-bug-report.sh

2/ search the file for "ECC Errors", there should be 8 sections, one for each GPU, reporting any ECC errors.

3/ if you identify a specific gpu/gpus with ECC errors, note the GPU UUID for record keeping, and try to reset that gpu from that node with sudo nvidia-smi --gpu-reset=<GPU Index> or sudo nvidia-smi --gpu-reset, although if there are any processess running on the GPU, Nvidia-smi may error out.

aws-samples / awsome-distributed-training

One of the cluster GPUs is failing, it seems. #201