CCI-MOC / ops-issues

2 stars 0 forks source link

FX2 Hardware Issues #1075

Open hakasapl opened 1 year ago

hakasapl commented 1 year ago

R4-PA-C08

R4-PA-C10

R4-PA-C21

R4-PA-C22

R4-PA-C24

joachimweyl commented 11 months ago

@hakasapl is this still an issue? If so can we get Flax or Techsquared to do it?

hakasapl commented 11 months ago

Yes, this is likely a techsquare ask, but not a priority for now, we can send this over to techsquare after the more pressing issues assigned to them are done

er1p commented 7 months ago

note that we started on this maintenance list over the weekend

specifically, we addressed

R4-PA-C08 MOC-R4PAC08U37-S3D - Firmware update fail - removed DBCFXK2 - replaced with H9W01Q2 MOC-R4PAC08U31-S3B - Firmware update fail - removed H9P51Q2 - replaced with DR75DH2

R4-PA-C10 MOC-R4PAC10U37-S3B - iDrac unresponsive - removed 50CYKH2 (no drives) - replaced with H9K31Q2 (two 200G drives) MOC-R4PAC10U35-S3A - iDrac unresponsive - removed 4WQZGQ2 - replaced with DR74DH2 MOC-R4PAC10U37-S3D - Firmware update fail - removed 8CSDB03 - replaced with DB9HXK2 MOC-R4PAC10U37-S3A - Firmware update fail - removed 50CZKH2 - replaced with 9DRH482 MOC-R4PAC10U17-S1 - Firmware update fail - removed 4XL81Q2 - replaced with 15W3DV2 (not in the list above) bad PSU on U33 - replaced & cleared orange error light note that there is an orange fault indicator on the chassis at U31, but issue is unclear physically

R4-PA-C22 MOC-R4PAC22U35-S1B - Firmware update fail - removed JVBHDH2 - replaced with DR67DH2 MOC-R4PAC22U13 (Chassis) - PSU 2 Failure- PSU repaired MOC-R4PAC22U13-S3 - Firmware update fail - didn't stay powered on - removed GVYCB03 - replaced with GTHZ4Z2 MOC-R4PAC22U11-S3 - Won't stay powered on - removed BGMRRZ2 - replaced with B6NQRZ2

we plan to return again later in the week to continue the replacements

er1p commented 7 months ago

to coordinate the chassis maintenance

R4-PA-C22 MOC-R4PAC22U37 (Chassis) - iDracs in Chassis Unreachable, CMC is

R4-PA-C21 MOC-R4PAC21U09-S1 - CMC Unreachable MOC-R4PAC21U09-S3 - CMC Unreachable

if you are able to migrate all workload from the nodes in C22/U37 and C21/U09, then we will swap the chassis and / or CMC

the same nodes would return to service after the outages but will power down during the mainteance

joachimweyl commented 6 months ago

@hakasapl what are the next steps?

er1p commented 6 months ago

note that for the hardware part of this, we need at least one more visit to complete the replacements

if the above replaced nodes have not yet been put into service, we can also recheck the bios clearing - as requested on Tuesday

er1p commented 6 months ago

updating based on additional work this weekend

R4-PA-C24 MOC-R4PAC24U35-S1D - CMOS battery failure - removed JGDVJH2 - replaced with H9S01Q2 MOC-R4PAC24U33-S3A - CMOS battery failure - removed JVC2KH2 - replaced with JGD1KH2 MOC-R4PAC24U33-S3D - never came back up from factory reset - false positive (?) seems to be running MOC-R4PAC24U31-S1B - BIOS update failure - DIMM error - removed DBKGXK2 - replaced with 6R7BSZ2

R4-PA-C21 MOC-R4PAC21U37-S1B - Firmware update failed, replace node - false positive (?) - seems to be booted & running MOC-R4PAC21U37-S1D - Won't stay powered on - removed 51G5LH2 - replaced with 4WW71Q2 MOC-R4PAC21U37-S3C - Won't stay powered on - removed 5084LH2 - replaced with 5044LH2 MOC-R4PAC21U35-S1D - Firmware update failed, replace node - removed H9L51Q2 - replaced with JVCXJH2 MOC-R4PAC21U31-S1D - Firmware update failed, replace node - removed DB7CXK2 - replaced with H9R31Q2 MOC-R4PAC21U25-S3 - Mellanox NIC not in boot options - replaced Mellanox NIC new MAC addresses 24:8A;07:1E:85:B4 and :B5 MOC-R4PAC21U13-S1 - Mellanox NIC not in boot options - replaced Mellanox NIC new MAC addresses EC:0D:9A:D4:94:90 and :91 MOC-R4PAC21U09-S1 - CMC Unreachable - reset DHCP configuration on CMC MOC-R4PAC21U09-S3 - CMC Unreachable - reset DHCP configuration on CMC

joachimweyl commented 6 months ago

@er1p I see a lot of replacements, were the replacements tested and confirmed working? The repairs are broken down into multiple comments. Can you confirm how many of the nodes that were broken are now up and running?

@hakasapl & @msdisme what are the plans for all of the replaced nodes, are we going to try to fix them or just retire them?

er1p commented 6 months ago

@joachimweyl unless we messed up managing the list, the only thing not addressed is

MOC-R4PAC22U37 (Chassis) - iDracs in Chassis Unreachable, CMC is

(didn't review that one in either visit)

all the other nodes were confirmed running at the hardware level - booting & waiting for DHCP / PXE boot

for all the "broken" systems, we took the units away with us and intend to repair / refresh into future spares (if repairable) or send along to ITAD (if hopeless)