Open wmhutchison opened 1 week ago
Case was opened with Red Hat. A response was received to the extent that nothing on the OS side in the sosreport could be seen to explain the failure, and no kernel dump was made.
Updated the case with the following link received by the VMWare team, which has some suggestions about disabling soft lockups and NMI watchdog for VMs, but from what I can tell, those are already disabled.
For now the node remains cordoned/drained. If possible will keep it drained/cordoned until the week of November 25th when it is EMERALD's turn for ESXi host maintenance. Otherwise may uncordon next week unless Red Hat comes back with something in response to the VMWare link shared.
Describe the issue A problem ticket was opened in response to an incident involving an unplanned reboot of a worker node in the EMERALD cluster. Investigate and coordination with vendor support as needed to determine root cause if possible.
Blocked Until EMERALD ESXi host maintenance is complete and no new issues arise due to that, will uncordon at that point.
Additional context Add any other context, attachments or screenshots
How does this benefit the users of our platform? Ensuring root cause is addressed or otherwise confirming no issues remain from putting the affected node back into service for user workloads.
Definition of done