BCDevOps / developer-experience

This repository is used to track all work for the BCGov Platform Services Team (This includes work for: 1. Platform Experience, 2. Developer Experience 3. Platform Operations/OCP 3)
Apache License 2.0
8 stars 17 forks source link

PRB0041010 CITZ - MCS EMERALD - Node MCS-EMERALD-APP-01.DMZ rebooted - root cause analysis #5298

Open wmhutchison opened 1 week ago

wmhutchison commented 1 week ago

Describe the issue A problem ticket was opened in response to an incident involving an unplanned reboot of a worker node in the EMERALD cluster. Investigate and coordination with vendor support as needed to determine root cause if possible.

Blocked Until EMERALD ESXi host maintenance is complete and no new issues arise due to that, will uncordon at that point.

Additional context Add any other context, attachments or screenshots

How does this benefit the users of our platform? Ensuring root cause is addressed or otherwise confirming no issues remain from putting the affected node back into service for user workloads.

Definition of done

wmhutchison commented 1 week ago

Case was opened with Red Hat. A response was received to the extent that nothing on the OS side in the sosreport could be seen to explain the failure, and no kernel dump was made.

Updated the case with the following link received by the VMWare team, which has some suggestions about disabling soft lockups and NMI watchdog for VMs, but from what I can tell, those are already disabled.

For now the node remains cordoned/drained. If possible will keep it drained/cordoned until the week of November 25th when it is EMERALD's turn for ESXi host maintenance. Otherwise may uncordon next week unless Red Hat comes back with something in response to the VMWare link shared.