cds-snc / notification-planning-core

Project planning for GC Notify Core Team
0 stars 0 forks source link

K8 Worker Node Upgrade Timed Out In Prod #266

Closed ben851 closed 4 months ago

ben851 commented 6 months ago

Describe the bug

When doing the K8 worker node upgrade in production, Terraform failed and rolled back due to a timeout on evicting pods. It was unclear which pods failed to evict. Manually deleting pods from nodes with scheduling disabled while running the TF apply again resolved the issue.

We did not see the same behaviour in staging

To Reproduce

TF Apply node upgrade in production

Expected behavior

Terraform should be able to upgrade the nodes on the K8s cluster without manual intervention

Impact

Impact on Notify team: This causes toil when doing terraform releases