BCDevOps / developer-experience

This repository is used to track all work for the BCGov Platform Services Team (This includes work for: 1. Platform Experience, 2. Developer Experience 3. Platform Operations/OCP 3)
Apache License 2.0
8 stars 17 forks source link

PRB0041013 CITZ - MCS - Problem ticket to track HPE unexpected reboots #5389

Open wmhutchison opened 4 days ago

wmhutchison commented 4 days ago

Describe the issue This ticket will track effort spent investigating some recent reboot issues in SILVER involving HPE gear and no discernable hardware events causing the reboot.

Additional context Vendor cases:

Related incidents:

Hardware servers but no hardware support tickets since this is not a hardware issue.

How does this benefit the users of our platform? Ensuring we have stable nodes to offer a consistent experience for our users.

Definition of done

wmhutchison commented 4 days ago

https://access.redhat.com/support/cases/#/case/03990081 is the active case right now. Based on support feedback, the root cause is the kernel.

The RHEL9 kernel fix: https://access.redhat.com/errata/RHSA-2024:9497 Link for showing OCP releases and specific kernel version: https://access.redhat.com/solutions/7077108

We are waiting for Red Hat to put out an OCP version with the required kernel. Since OCP 4.14.41 dropped on November 20th without the new kernel, we'd be waiting for OCP 4.14.42 to drop with this.

At present we'll likely be continuing on coarse with fixing this in the official OCP 4.16 upgrade, but if this issue worsens, we might need to re-think this and apply an OCP 4.14-latest in SILVER as soon as possible.