frazer-lab / cluster

Repo for cluster issues.
1 stars 0 forks source link

Odd we just lost several "CN" nodes #267

Closed tatarsky closed 5 years ago

tatarsky commented 5 years ago

I'm going to mention this in case something is going on in the SDSC racks.

We just lost cn2, cn9, cn5, cn11, cn12.

If its because of jobs @s041629 I think is running be aware....

Remember the combined bandwidth from the CN (old) nodes is a single 1Gbit/sec link.

s041629 commented 5 years ago

should I stop my jobs?

I am running them on the opt queue because I don't have read/write to do

tatarsky commented 5 years ago

Wow and we lost their IPMI interfaces as well. So whatever happened its low level.

I have no way to determine how low so @hirokomatsui a rack visit may be in order if you wish to see what happened. Several CN nodes continue to run but those

tatarsky commented 5 years ago

@s041629 to be honest I'd be surprised your jobs could knock one of these nodes so completely offline (including the internal KVM). But I literally have no way to tell without that access so its pretty much a trip to the racks which I'd love to do as I'm freezing in Wisconsin but I cannot.

Your network use seemed reasonable and the ones that didn't all just vanish are doing jobs I consider "fine" for those AMD quads. But something made five of them go poof!

hirokomatsui commented 5 years ago

I will go tomorrow afternoon if needed. We're having a storm here (nothing like the one in east coast, but still), could affect something there.

tatarsky commented 5 years ago

Sounds good. It might have been a PDU breaker or something due to the number of them. I could ask the SDSC ops to take a peek if you want to save a trip for just that sort of thing. The fact the IPMI is off lends some credence to power problems.

tatarsky commented 5 years ago

I tossed over a ticket to low priority just see if a PDU circuit tripped. Try to save you the trip @hirokomatsui !

hirokomatsui commented 5 years ago

Thank you!

tatarsky commented 5 years ago

Response from SDSC was they could not tell which node was which because I guess the older nodes have no labels. Sadly the serial numbers from the boards cannot be read from the OS either.

I tried turning on Identity lights for the nodes remaining up just to spot them but I suspect somebody will have to go over and visually check them.

I remain fairly suspect this is a tripped power PDU that those nodes are plugged into.

happy to help if you go over just give me some notice so we could facetime it or similar!

tatarsky commented 5 years ago

Ah, they found it was indeed tripped!

The rack for those devices is K36. The inside mounted PDU had a Bank-2 circuit breaker tripped so I reset it. Mostly green lights now so check and see if that fixed the issue.

Which means I suspect we have an unbalanced node to PDU ratio. Perhaps some afternoon could count which nodes go to which bank and balance.

First time this has happened in N years though so also possibly just a fluke.