deinstapel / eks-rolling-update

EKS Rolling Update is a utility for updating the launch configuration of worker nodes in an EKS cluster.
Apache License 2.0
3 stars 4 forks source link

Failed node update will leave autoscaler in disabled state #25

Open Jasper-Ben opened 11 months ago

Jasper-Ben commented 11 months ago

When an eks-rolling-update job failes, the previous cluster state is not automatically recovered, instead requiring manual intervention:

│ iris-devops-rolling-node-update-manual-42b-hrvhd 2023-10-09 10:25:13,138 INFO     InstanceId i-026bce300ffa7d8d0 is node ip-10-208-33-228.eu-central-1.compute.internal in kubernetes land                                                 │
│ iris-devops-rolling-node-update-manual-42b-hrvhd 2023-10-09 10:25:13,138 INFO     Draining worker node with kubectl drain ip-10-208-33-228.eu-central-1.compute.internal --ignore-daemonsets --delete-emptydir-data --timeout=300s...      │
│ iris-devops-rolling-node-update-manual-42b-hrvhd node/ip-10-208-33-228.eu-central-1.compute.internal already cordoned                                                                                                                      │
│ iris-devops-rolling-node-update-manual-42b-hrvhd error: unable to drain node "ip-10-208-33-228.eu-central-1.compute.internal" due to error:cannot delete Pods declare no controller (use --force to override): gitlab-runner/runner-8e3ydb │
│ hn-project-1998-concurrent-8-ek1wq7qg, continuing command...                                                                                                                                                                               │
│ iris-devops-rolling-node-update-manual-42b-hrvhd There are pending nodes to be drained:                                                                                                                                                    │
│ iris-devops-rolling-node-update-manual-42b-hrvhd  ip-10-208-33-228.eu-central-1.compute.internal                                                                                                                                           │
│ iris-devops-rolling-node-update-manual-42b-hrvhd cannot delete Pods declare no controller (use --force to override): gitlab-runner/runner-8e3ydbhn-project-1998-concurrent-8-ek1wq7qg                                                      │
│ iris-devops-rolling-node-update-manual-42b-hrvhd 2023-10-09 10:25:13,990 INFO     Node not drained properly. Exiting                                                                                                                       │
│ iris-devops-rolling-node-update-manual-42b-hrvhd 2023-10-09 10:25:13,990 ERROR    ('Rolling update on ASG failed', 'ci-runner-kas-20230710121010942300000012')                                                                             │
│ iris-devops-rolling-node-update-manual-42b-hrvhd 2023-10-09 10:25:13,990 ERROR    *** Rolling update of ASG has failed. Exiting ***                                                                                                        │
│ iris-devops-rolling-node-update-manual-42b-hrvhd 2023-10-09 10:25:13,990 ERROR    AWS Auto Scaling Group processes will need resuming manually                                                                                             │
│ iris-devops-rolling-node-update-manual-42b-hrvhd 2023-10-09 10:25:13,990 ERROR    Kubernetes Cluster Autoscaler will need resuming manually                                                                                                │

Most notably, auto-scaling will be scaled to 0. This is an issue as our workloads (especially CI) heavily depend on functioning auto-scaling.

Jasper-Ben commented 6 months ago

These exceptions should cause an uncordon on the affected nodes:

https://github.com/deinstapel/eks-rolling-update/blob/master/eksrollup/lib/k8s.py#L195-L198

@martin31821 please look into it. Thx :slightly_smiling_face: