When doing the K8 worker node upgrade in production, Terraform failed and rolled back due to a timeout on evicting pods. It was unclear which pods failed to evict. Manually deleting pods from nodes with scheduling disabled while running the TF apply again resolved the issue.
We did not see the same behaviour in staging
To Reproduce
TF Apply node upgrade in production
Expected behavior
Terraform should be able to upgrade the nodes on the K8s cluster without manual intervention
Impact
Impact on Notify team:
This causes toil when doing terraform releases
Describe the bug
When doing the K8 worker node upgrade in production, Terraform failed and rolled back due to a timeout on evicting pods. It was unclear which pods failed to evict. Manually deleting pods from nodes with scheduling disabled while running the TF apply again resolved the issue.
We did not see the same behaviour in staging
To Reproduce
TF Apply node upgrade in production
Expected behavior
Terraform should be able to upgrade the nodes on the K8s cluster without manual intervention
Impact
Impact on Notify team: This causes toil when doing terraform releases