Open hakasapl opened 1 year ago
@hakasapl is this still an issue? If so can we get Flax or Techsquared to do it?
Yes, this is likely a techsquare ask, but not a priority for now, we can send this over to techsquare after the more pressing issues assigned to them are done
note that we started on this maintenance list over the weekend
specifically, we addressed
R4-PA-C08 MOC-R4PAC08U37-S3D - Firmware update fail - removed DBCFXK2 - replaced with H9W01Q2 MOC-R4PAC08U31-S3B - Firmware update fail - removed H9P51Q2 - replaced with DR75DH2
R4-PA-C10 MOC-R4PAC10U37-S3B - iDrac unresponsive - removed 50CYKH2 (no drives) - replaced with H9K31Q2 (two 200G drives) MOC-R4PAC10U35-S3A - iDrac unresponsive - removed 4WQZGQ2 - replaced with DR74DH2 MOC-R4PAC10U37-S3D - Firmware update fail - removed 8CSDB03 - replaced with DB9HXK2 MOC-R4PAC10U37-S3A - Firmware update fail - removed 50CZKH2 - replaced with 9DRH482 MOC-R4PAC10U17-S1 - Firmware update fail - removed 4XL81Q2 - replaced with 15W3DV2 (not in the list above) bad PSU on U33 - replaced & cleared orange error light note that there is an orange fault indicator on the chassis at U31, but issue is unclear physically
R4-PA-C22 MOC-R4PAC22U35-S1B - Firmware update fail - removed JVBHDH2 - replaced with DR67DH2 MOC-R4PAC22U13 (Chassis) - PSU 2 Failure- PSU repaired MOC-R4PAC22U13-S3 - Firmware update fail - didn't stay powered on - removed GVYCB03 - replaced with GTHZ4Z2 MOC-R4PAC22U11-S3 - Won't stay powered on - removed BGMRRZ2 - replaced with B6NQRZ2
we plan to return again later in the week to continue the replacements
to coordinate the chassis maintenance
R4-PA-C22 MOC-R4PAC22U37 (Chassis) - iDracs in Chassis Unreachable, CMC is
R4-PA-C21 MOC-R4PAC21U09-S1 - CMC Unreachable MOC-R4PAC21U09-S3 - CMC Unreachable
if you are able to migrate all workload from the nodes in C22/U37 and C21/U09, then we will swap the chassis and / or CMC
the same nodes would return to service after the outages but will power down during the mainteance
@hakasapl what are the next steps?
note that for the hardware part of this, we need at least one more visit to complete the replacements
if the above replaced nodes have not yet been put into service, we can also recheck the bios clearing - as requested on Tuesday
updating based on additional work this weekend
R4-PA-C24 MOC-R4PAC24U35-S1D - CMOS battery failure - removed JGDVJH2 - replaced with H9S01Q2 MOC-R4PAC24U33-S3A - CMOS battery failure - removed JVC2KH2 - replaced with JGD1KH2 MOC-R4PAC24U33-S3D - never came back up from factory reset - false positive (?) seems to be running MOC-R4PAC24U31-S1B - BIOS update failure - DIMM error - removed DBKGXK2 - replaced with 6R7BSZ2
R4-PA-C21 MOC-R4PAC21U37-S1B - Firmware update failed, replace node - false positive (?) - seems to be booted & running MOC-R4PAC21U37-S1D - Won't stay powered on - removed 51G5LH2 - replaced with 4WW71Q2 MOC-R4PAC21U37-S3C - Won't stay powered on - removed 5084LH2 - replaced with 5044LH2 MOC-R4PAC21U35-S1D - Firmware update failed, replace node - removed H9L51Q2 - replaced with JVCXJH2 MOC-R4PAC21U31-S1D - Firmware update failed, replace node - removed DB7CXK2 - replaced with H9R31Q2 MOC-R4PAC21U25-S3 - Mellanox NIC not in boot options - replaced Mellanox NIC new MAC addresses 24:8A;07:1E:85:B4 and :B5 MOC-R4PAC21U13-S1 - Mellanox NIC not in boot options - replaced Mellanox NIC new MAC addresses EC:0D:9A:D4:94:90 and :91 MOC-R4PAC21U09-S1 - CMC Unreachable - reset DHCP configuration on CMC MOC-R4PAC21U09-S3 - CMC Unreachable - reset DHCP configuration on CMC
@er1p I see a lot of replacements, were the replacements tested and confirmed working? The repairs are broken down into multiple comments. Can you confirm how many of the nodes that were broken are now up and running?
@hakasapl & @msdisme what are the plans for all of the replaced nodes, are we going to try to fix them or just retire them?
@joachimweyl unless we messed up managing the list, the only thing not addressed is
MOC-R4PAC22U37 (Chassis) - iDracs in Chassis Unreachable, CMC is
(didn't review that one in either visit)
all the other nodes were confirmed running at the hardware level - booting & waiting for DHCP / PXE boot
for all the "broken" systems, we took the units away with us and intend to repair / refresh into future spares (if repairable) or send along to ITAD (if hopeless)
R4-PA-C08
R4-PA-C10
R4-PA-C21
R4-PA-C22
R4-PA-C24