BCDevOps / developer-experience

This repository is used to track all work for the BCGov Platform Services Team (This includes work for: 1. Platform Experience, 2. Developer Experience 3. Platform Operations/OCP 3)
Apache License 2.0
8 stars 17 forks source link

INC0098572 CITZ - SILVER - Hardware incident involving MCS-SILVER-APP-14.DMZ #4986

Closed wmhutchison closed 1 month ago

wmhutchison commented 1 month ago

Describe the issue

Checklist

wmhutchison commented 1 month ago

Crossed out the check-box involving an MS Teams call. This issue requires the Platform Ops on-call team and the SPOC to do the due diligence for making the P3 incident a P2, which is in progress and should be completely soon.

Also proactively adjusted Nagios monitoring so that EFK logging checks will not page out for the affected node.

wmhutchison commented 1 month ago

A forced node drain was performed within 15 minutes after the original alert. A full list of affected namespaces and pods was posted in #devops-alerts. The forced node drain remediated the original INC ticket and is now resolved. Will move this now to Blocked while we await DXC Situation Management's call on whether or not this will require an Outage report, will gather data for that if needed.

wmhutchison commented 1 month ago

a new Problem ticket (will Open ZenHub ticket to match) is now open to track the follow up with vendor support for investigating the hardware issue and resolving it so that the node can be put back into production.

wmhutchison commented 1 month ago

No request from SitMan for an outage report, closing this off.