Closed tatarsky closed 6 years ago
We probably have a failed HBA card. We'll discuss on Monday the repair process.
Confirmed we have a failed HBA card. Will open a AHPC ticket to see if warranty covered or will ask for order to replace.
Unit has 2 x LSI 9300-8e External 12Gb/s SATA/SAS Host Bus Adapters
. One is not showing on the PCI-E bus so its probably died. Cost used $328 as its EOL from Avago.
Lets see if we get one for free as part of warranty.
Ticket opened with AHPC (1010374). I believe we are under 3 year support. Which BTW we should add to our discussion as we are I believe coming up on that quickly. (Cluster was deployed 11/2015)
Thanks for catching this up. We'll wait for AHPC answering.
They are working on timing. We might take a maint window at the same time but I believe I can fail things over gracefully to the other server if we cannot.
AHPC has the card in stock and can replace anytime we want. Do we want to do this ASAP? Might be safest to have the filesystem quiet or even unmounted. @hirokomatsui what is the compute load coming up this next few days?
We want to do that ASAP, and wait to run a large jobs. Can you ask them how long it will take?
Yep. I can ask. Its a PCI-E card replacement so the server needs to come all the way down. I can see if I can fail over to the other system starting now.
I have confirmed I can failover all the LUNS to the other server for the replacement. So we do not have to umount.
But I'm holding for a confirm they intend to get here either today or tomorrow.
AHPC can do the swap tomorrow. No time yet but I will fail over some additional targets in the morning. I did half of them.
Thanks. We'll be keep using the system, but only waiting for a large jobs to be submitted in the case.
We are scheduled for 11:00AM tomorrow. Cesar@AHPC will come do the work per the terms of our support (onsite). I will migrate the rest of the mounts before that. Probably first thing in the morning.
All OSTs are failed over to the other server. Expect reduced performance but should be functional while the work is done.
AHPC is onsite working on the unit. It is shutdown and they are unracking it.
HBA fixed (replaced that is). Will failback OSTs in a moment after some checks. There will be a pause when that happens.
Failback complete. Lustre running on both OSS servers again. Multipaths are green. Closing.
Thank you!
You are very welcome.
For reasons unclear fl-oss-1-1 is not running both SAS paths. I got an alert about it this morning. Thats not overly good. I will try to ask SDSC if they can eyeball the SAS cables in case one got bumped.