Closed cfregly closed 4 months ago
HyperPod was designed for resiliency against GPU failures which are not abnormal for large GPU clusters. Typically, HyperPod will replace the failed instance in a matter of minutes. In the meantime, you can review if there are any ECC errors with the GPUs as follows:
1/ run Nvidia bug report on the node where error occured, the bug report file is found in /usr/bin/nvidia-bug-report.sh
2/ search the file for "ECC Errors", there should be 8 sections, one for each GPU, reporting any ECC errors.
3/ if you identify a specific gpu/gpus with ECC errors, note the GPU UUID for record keeping, and try to reset that gpu from that node with sudo nvidia-smi --gpu-reset=<GPU Index>
or sudo nvidia-smi --gpu-reset
, although if there are any processess running on the GPU, Nvidia-smi may error out.
One of the cluster GPUs is failing, it seems.