cBio / cbio-cluster

MSKCC cBio cluster documentation
12 stars 2 forks source link

gpu-1-8 came up short a GPU #365

Closed tatarsky closed 8 years ago

tatarsky commented 8 years ago

The recent reboot of gpu-1-8 came up short a GPU and my alert for that situation I did not see. I have logged that in case we need to contact Exacct.

There are currently batch only jobs on it. So I have offlined it again and will try to reboot or request hardware reseats. Jobs on it will run to completion.

I have double checked my alerts are correct on all other GPU nodes and I believe they are. I just didn't see the nagios email on gpu-1-8.

tatarsky commented 8 years ago

Confirmed manually all other GPU units are correct count of 4. Alerts are correct. Added a step to my reboot process to remember to check before returning to service.

tatarsky commented 8 years ago

Reboot fixed matter. Unit returned to service.