This repository provides easy automation scripts for building a HPC environment in Azure. It also includes examples to build e2e environment and run some of the key HPC benchmarks and applications.
MIT License
124
stars
66
forks
source link
Initial version of GPU ECC error checking tool #617
-conveniently reports all relevant GPU ECC counts in a report.
-Analyses the GPU ECC errors and makes recommendations what actions you should take to recover from the ECC errors (e.g no-action (its healthy), re-boot node to recover or submit a support request to report an unhealthy node).
I would suggest returning unique error codes when issues are detected.
This would allow a calling script (e.g. NHC) to easily capture and identify errors.
-conveniently reports all relevant GPU ECC counts in a report. -Analyses the GPU ECC errors and makes recommendations what actions you should take to recover from the ECC errors (e.g no-action (its healthy), re-boot node to recover or submit a support request to report an unhealthy node).