Closed tatarsky closed 8 years ago
Confirmed manually all other GPU units are correct count of 4. Alerts are correct. Added a step to my reboot process to remember to check before returning to service.
Reboot fixed matter. Unit returned to service.
The recent reboot of gpu-1-8 came up short a GPU and my alert for that situation I did not see. I have logged that in case we need to contact Exacct.
There are currently batch only jobs on it. So I have offlined it again and will try to reboot or request hardware reseats. Jobs on it will run to completion.
I have double checked my alerts are correct on all other GPU nodes and I believe they are. I just didn't see the nagios email on gpu-1-8.