eksctl-io / eksctl

The official CLI for Amazon EKS
https://eksctl.io
Other
4.93k stars 1.41k forks source link

[Bug] Logs report "failed to acquire semaphore" during deletion #7818

Open artem-nefedov opened 5 months ago

artem-nefedov commented 5 months ago

What were you trying to accomplish?

Delete the cluster (it seem to work fine).

What happened?

Logs report this message during deletion:

[ℹ]  deleting EKS cluster "redacted"
[ℹ]  will drain 0 unmanaged nodegroup(s) in cluster "redacted"
[ℹ]  starting parallel draining, max in-flight of 1
[✖]  failed to acquire semaphore while waiting for all routines to finish: %!w(*errors.errorString=&{context canceled})

Deletion still finished without errors, so it does not look like this affect anything. But the log does look like there's a problem.

The behavior is reproduced on all attempts.

How to reproduce it?

Create cluster with 1 managed nodegroup and no unmanaged nodegroups, then delete it.

Versions

eksctl version: 0.180.0
EKS version 1.30

The message was not present on eksctl version 0.176.0 with EKS version 1.29 (there are no changes in cluster config besides EKS version).

cPu1 commented 5 months ago

@artem-nefedov, this is a bug in the logging and concurrency handling but it should not affect normal operation of the command. That part of the codebase is a bit dated and could use some refactoring. We'll look into this soon.

lgb861213 commented 4 months ago

We also encountered the same error log information when deleting EKS version 1.30, and our eksctl version is 0.183. the error message that is following: 2024-07-06 16:56:08 [ℹ] deleting EKS cluster "test" 2024-07-06 16:56:11 [ℹ] will drain 0 unmanaged nodegroup(s) in cluster "aloda-test" 2024-07-06 16:56:11 [ℹ] starting parallel draining, max in-flight of 1 2024-07-06 16:56:11 [✖] failed to acquire semaphore while waiting for all routines to finish: %!w(*errors.errorString=&{context canceled}) 2024-07-06 16:56:14 [ℹ] deleted 0 Fargate profile(s) 2024-07-06 16:56:16 [✔] kubeconfig has been updated 2024-07-06 16:56:16 [ℹ] cleaning up AWS load balancers created by Kubernetes objects of Kind Service or Ingress 2024-07-06 16:56:23 [ℹ]

AmitBenAmi commented 4 months ago

Seeing the same issue with version 0.185.0

acarey-haus commented 3 months ago

Seeing this issue with eksctl version 0.187.0 when deleting a nodegroup. The deletion succeeded.

% eksctl delete nodegroup --cluster redacted --name redacted-nodegroup
2024-07-18 13:31:37 [ℹ]  1 nodegroup (redacted-nodegroup) was included (based on the include/exclude rules)
2024-07-18 13:31:37 [ℹ]  will drain 1 nodegroup(s) in cluster "redacted"
2024-07-18 13:31:37 [ℹ]  starting parallel draining, max in-flight of 1
2024-07-18 13:31:37 [!]  no nodes found in nodegroup "redacted-nodegroup" (label selector: "alpha.eksctl.io/nodegroup-name=redacted-nodegroup")
2024-07-18 13:31:37 [✖]  failed to acquire semaphore while waiting for all routines to finish: context canceled
2024-07-18 13:31:37 [ℹ]  will delete 1 nodegroups from cluster "redacted"
2024-07-18 13:31:40 [ℹ]  1 task: { 1 task: { delete nodegroup "redacted-nodegroup" [async] } }
2024-07-18 13:31:40 [ℹ]  will delete stack "eksctl-redacted-nodegroup-redacted-nodegrou"p
2024-07-18 13:31:40 [✔]  deleted 1 nodegroup(s) from cluster "redacted"
fnzwex commented 3 months ago

Test results after finding this and in an attempt to help:

0.176.0 - good until 1.30 - 1.30 is not supported and it refuses to work 0.177.0 - this issue 0.178.0 - this issue 0.179.0 - this issue 0.180.0 through 0.186.0 - untested by me but presumed bad since surrounded by bad 0.187.0 - this issue 0.188.0 - STILL this issue - 5 days old.

It'd be great if this could be addressed ASAP and released as 0.189.0 soon. Any chance of that?

Pretty bad that it got broken in the first place and even worse that it got left broken for such a long time.

Still a better way to manage clusters than Terraform/OpenTofu IMO (when it works properly) (which NO version does for 1.30)

jarvisbot01 commented 1 month ago

eksctl version 0.190.0 eks 1.30

2024-09-21 14:55:45 [✖] failed to acquire semaphore while waiting for all routines to finish: context canceled

AndrewFarley commented 1 month ago

Guys, this is my main selling point and favorite part of EKSCTL. Without this working, there's far less reason to use this tool. What this ends up doing below is AGGRESSIVELY removing all the nodes when the ASG is removed without draining gracefully. This is fairly critical folks, please fix ASAP. If you can't fix ASAP, then it should at least catch this condition and not just skip over draining without an --force-remove or something argument. This is an outage-causing bug.

Versions

EKSCTL Versions occurred for me: 0.183.0, so then I upgraded to 0.191.0, happened on both EKS 1.29

Debug output / proof

$ eksctl version
0.191.0
$ eksctl delete nodegroup --config-file=./dev.yaml  --include mynodes-ondemand-ue1b-v5 --approve
2024-10-11 14:58:58 [ℹ]  comparing 6 nodegroups defined in the given config ("./dev.yaml") against remote state
2024-10-11 14:58:58 [ℹ]  combined include rules: mynodes-ondemand-ue1b-v5
2024-10-11 14:58:58 [ℹ]  1 nodegroup (mynodes-ondemand-ue1b-v5) was included (based on the include/exclude rules)
2024-10-11 14:59:02 [ℹ]  will drain 1 nodegroup(s) in cluster "dev"
2024-10-11 14:59:02 [ℹ]  starting parallel draining, max in-flight of 1
2024-10-11 14:59:13 [✔]  drained all nodes: [ip-10-52-113-253.ec2.internal ip-10-52-101-145.ec2.internal ip-10-52-73-249.ec2.internal ip-10-52-125-52.ec2.internal]
2024-10-11 14:59:13 [✖]  failed to acquire semaphore while waiting for all routines to finish: context canceled
2024-10-11 14:59:13 [ℹ]  will delete 1 nodegroups from cluster "dev"
2024-10-11 14:59:17 [ℹ]  1 task: { 1 task: { delete nodegroup "mynodes-ondemand-ue1b-v5" [async] } }
2024-10-11 14:59:17 [ℹ]  will delete stack "eksctl-dev-nodegroup-mynodes-ondemand-ue1b-v5"
2024-10-11 14:59:17 [✔]  deleted 1 nodegroup(s) from cluster "dev"