Closed wmhutchison closed 1 month ago
Crossed out the check-box involving an MS Teams call. This issue requires the Platform Ops on-call team and the SPOC to do the due diligence for making the P3 incident a P2, which is in progress and should be completely soon.
Also proactively adjusted Nagios monitoring so that EFK logging checks will not page out for the affected node.
A forced node drain was performed within 15 minutes after the original alert. A full list of affected namespaces and pods was posted in #devops-alerts. The forced node drain remediated the original INC ticket and is now resolved. Will move this now to Blocked while we await DXC Situation Management's call on whether or not this will require an Outage report, will gather data for that if needed.
a new Problem ticket (will Open ZenHub ticket to match) is now open to track the follow up with vendor support for investigating the hardware issue and resolving it so that the node can be put back into production.
No request from SitMan for an outage report, closing this off.
Describe the issue
Checklist
Open a MS Teams call. Invite SPOC and someone from Platform Services to the call