frazer-lab / cluster

Repo for cluster issues.
1 stars 0 forks source link

fl-oss-1-1 has lost a SAS path because HBA card failed #244

Closed tatarsky closed 6 years ago

tatarsky commented 6 years ago

For reasons unclear fl-oss-1-1 is not running both SAS paths. I got an alert about it this morning. Thats not overly good. I will try to ask SDSC if they can eyeball the SAS cables in case one got bumped.

tatarsky commented 6 years ago

We probably have a failed HBA card. We'll discuss on Monday the repair process.

tatarsky commented 6 years ago

Confirmed we have a failed HBA card. Will open a AHPC ticket to see if warranty covered or will ask for order to replace.

tatarsky commented 6 years ago

Unit has 2 x LSI 9300-8e External 12Gb/s SATA/SAS Host Bus Adapters. One is not showing on the PCI-E bus so its probably died. Cost used $328 as its EOL from Avago.

Lets see if we get one for free as part of warranty.

tatarsky commented 6 years ago

Ticket opened with AHPC (1010374). I believe we are under 3 year support. Which BTW we should add to our discussion as we are I believe coming up on that quickly. (Cluster was deployed 11/2015)

hirokomatsui commented 6 years ago

Thanks for catching this up. We'll wait for AHPC answering.

tatarsky commented 6 years ago

They are working on timing. We might take a maint window at the same time but I believe I can fail things over gracefully to the other server if we cannot.

tatarsky commented 6 years ago

AHPC has the card in stock and can replace anytime we want. Do we want to do this ASAP? Might be safest to have the filesystem quiet or even unmounted. @hirokomatsui what is the compute load coming up this next few days?

hirokomatsui commented 6 years ago

We want to do that ASAP, and wait to run a large jobs. Can you ask them how long it will take?

tatarsky commented 6 years ago

Yep. I can ask. Its a PCI-E card replacement so the server needs to come all the way down. I can see if I can fail over to the other system starting now.

tatarsky commented 6 years ago

I have confirmed I can failover all the LUNS to the other server for the replacement. So we do not have to umount.

But I'm holding for a confirm they intend to get here either today or tomorrow.

tatarsky commented 6 years ago

AHPC can do the swap tomorrow. No time yet but I will fail over some additional targets in the morning. I did half of them.

hirokomatsui commented 6 years ago

Thanks. We'll be keep using the system, but only waiting for a large jobs to be submitted in the case.

tatarsky commented 6 years ago

We are scheduled for 11:00AM tomorrow. Cesar@AHPC will come do the work per the terms of our support (onsite). I will migrate the rest of the mounts before that. Probably first thing in the morning.

tatarsky commented 6 years ago

All OSTs are failed over to the other server. Expect reduced performance but should be functional while the work is done.

tatarsky commented 6 years ago

AHPC is onsite working on the unit. It is shutdown and they are unracking it.

tatarsky commented 6 years ago

HBA fixed (replaced that is). Will failback OSTs in a moment after some checks. There will be a pause when that happens.

tatarsky commented 6 years ago

Failback complete. Lustre running on both OSS servers again. Multipaths are green. Closing.

hirokomatsui commented 6 years ago

Thank you!

tatarsky commented 6 years ago

You are very welcome.