Closed tatarsky closed 5 years ago
should I stop my jobs?
I am running them on the opt queue because I don't have read/write to do
Wow and we lost their IPMI interfaces as well. So whatever happened its low level.
I have no way to determine how low so @hirokomatsui a rack visit may be in order if you wish to see what happened. Several CN nodes continue to run but those
@s041629 to be honest I'd be surprised your jobs could knock one of these nodes so completely offline (including the internal KVM). But I literally have no way to tell without that access so its pretty much a trip to the racks which I'd love to do as I'm freezing in Wisconsin but I cannot.
Your network use seemed reasonable and the ones that didn't all just vanish are doing jobs I consider "fine" for those AMD quads. But something made five of them go poof!
I will go tomorrow afternoon if needed. We're having a storm here (nothing like the one in east coast, but still), could affect something there.
Sounds good. It might have been a PDU breaker or something due to the number of them. I could ask the SDSC ops to take a peek if you want to save a trip for just that sort of thing. The fact the IPMI is off lends some credence to power problems.
I tossed over a ticket to low priority just see if a PDU circuit tripped. Try to save you the trip @hirokomatsui !
Thank you!
Response from SDSC was they could not tell which node was which because I guess the older nodes have no labels. Sadly the serial numbers from the boards cannot be read from the OS either.
I tried turning on Identity lights for the nodes remaining up just to spot them but I suspect somebody will have to go over and visually check them.
I remain fairly suspect this is a tripped power PDU that those nodes are plugged into.
happy to help if you go over just give me some notice so we could facetime it or similar!
Ah, they found it was indeed tripped!
The rack for those devices is K36. The inside mounted PDU had a Bank-2 circuit breaker tripped so I reset it. Mostly green lights now so check and see if that fixed the issue.
Which means I suspect we have an unbalanced node to PDU ratio. Perhaps some afternoon could count which nodes go to which bank and balance.
First time this has happened in N years though so also possibly just a fluke.
I'm going to mention this in case something is going on in the SDSC racks.
We just lost cn2, cn9, cn5, cn11, cn12.
If its because of jobs @s041629 I think is running be aware....
Remember the combined bandwidth from the CN (old) nodes is a single 1Gbit/sec link.