Azure / azurehpc

This repository provides easy automation scripts for building a HPC environment in Azure. It also includes examples to build e2e environment and run some of the key HPC benchmarks and applications.
MIT License
124 stars 66 forks source link

Initial version of GPU ECC error checking tool #617

Closed garvct closed 2 years ago

garvct commented 2 years ago

-conveniently reports all relevant GPU ECC counts in a report. -Analyses the GPU ECC errors and makes recommendations what actions you should take to recover from the ECC errors (e.g no-action (its healthy), re-boot node to recover or submit a support request to report an unhealthy node).

vanzod commented 2 years ago

I would suggest returning unique error codes when issues are detected. This would allow a calling script (e.g. NHC) to easily capture and identify errors.