banzaicloud / pipeline

Banzai Cloud Pipeline is a solution-oriented application platform which allows enterprises to develop, deploy and securely scale container-based applications in multi- and hybrid-cloud environments.
https://banzaicloud.com/products/pipeline/
Apache License 2.0
1.5k stars 163 forks source link

Failed (before master node created) to create EKS cluster fails deletion at first try, succeeds on second #3316

Open pregnor opened 3 years ago

pregnor commented 3 years ago

Describe the bug

When an EKS cluster fails the creation before the control plane successfully being created, the first deletion attempt runs into issues when trying to delete the node pool label set and fails with secret not found, but the subsequent deletion attempt succeeds without leaving any resources behind.

Steps to reproduce the issue:

  1. Try creating an EKS cluster with insufficient privileges to cluster role creation, envelope encryption or possibly to VPC creation.
  2. Observe the creation to fail before the control plane could be created.
  3. Try deleting the failed cluster.
  4. Observe the deletion to fail with secret not found on node pool stack deletion.
  5. Try deleting the failed cluster again.
  6. Observe the deletion to succeed.
  7. Check the AWS resources for the cluster to be completely removed.

Expected behavior

The first deletion attempt should succeed as the second does.

Additional context

My guess is on a non-conditional node pool label set deletion early during the deletion process even when the node pool label set operator had not been installed before.

Error:

ERRO[2096] Activity error.                               
ActivityType=delete-node-pool-label-set 
Domain=pipeline 
RunID=5b16698d-fb2d-4d74-934a-de1160bd68e2 
TaskList=pipeline 
WorkerID=42495@-@pipeline 
WorkflowID=2a8a5d23-ff29-4fae-93c5-0293da27dbfd_2 
application=pipeline.worker 
component=cadence-worker 
environment=production 
error="secret not found" 
errorVerbose="secret not found
github.com/banzaicloud/pipeline/internal/common/commonadapter.(*SecretStore).GetSecretValues
    /Users/pregnor/development/src/github.com/banzaicloud/pipeline/internal/common/commonadapter/secret.go:87
github.com/banzaicloud/pipeline/internal/kubernetes.DefaultConfigFactory.FromSecret
    /Users/pregnor/development/src/github.com/banzaicloud/pipeline/internal/kubernetes/config_factory.go:55
github.com/banzaicloud/pipeline/internal/kubernetes.DynamicClientFactory.FromSecret
    /Users/pregnor/development/src/github.com/banzaicloud/pipeline/internal/kubernetes/client_factory.go:64
github.com/banzaicloud/pipeline/internal/cluster.DynamicClientFactory.FromClusterID
    /Users/pregnor/development/src/github.com/banzaicloud/pipeline/internal/cluster/client_factory.go:108
github.com/banzaicloud/pipeline/internal/cluster/clusterworkflow.DeleteNodePoolLabelSetActivity.Execute
    /Users/pregnor/development/src/github.com/banzaicloud/pipeline/internal/cluster/clusterworkflow/delete_node_pool_label_set_activity.go:48
reflect.Value.call
    /usr/local/Cellar/go/1.15.5/libexec/src/reflect/value.go:476
reflect.Value.Call
    /usr/local/Cellar/go/1.15.5/libexec/src/reflect/value.go:337
go.uber.org/cadence/internal.(*activityExecutor).Execute
    /Users/pregnor/development/pkg/mod/go.uber.org/cadence@v0.13.4/internal/internal_worker.go:710
go.uber.org/cadence/internal.(*activityTaskHandlerImpl).Execute
    /Users/pregnor/development/pkg/mod/go.uber.org/cadence@v0.13.4/internal/internal_task_handlers.go:1820
go.uber.org/cadence/internal.(*activityTaskPoller).ProcessTask
    /Users/pregnor/development/pkg/mod/go.uber.org/cadence@v0.13.4/internal/internal_task_pollers.go:886
go.uber.org/cadence/internal.(*baseWorker).processTask
    /Users/pregnor/development/pkg/mod/go.uber.org/cadence@v0.13.4/internal/internal_worker_base.go:321
runtime.goexit
    /usr/local/Cellar/go/1.15.5/libexec/src/runtime/asm_amd64.s:1374"
pregnor commented 3 years ago

An easier reproduction:

diff --git a/templates/eks/amazon-eks-iam-cf.yaml b/templates/eks/amazon-eks-iam-cf.yaml
index a797c34cd..b6330f938 100644
--- a/templates/eks/amazon-eks-iam-cf.yaml
+++ b/templates/eks/amazon-eks-iam-cf.yaml
@@ -2,6 +2,10 @@ AWSTemplateFormatVersion: '2010-09-09'
 Description: 'Amazon EKS IAM'

 Parameters:
+  BreakMe:
+    Type: String
+    Description: reproduction requirement.
+
   ClusterName:
     Type: String
     Description: The name of the EKS cluster.
janosSarusiKis commented 3 years ago

DeleteNodePoolLabelSetActivity is the one that fails to run properly on the first time. The second time this activity does not triggered, so this may be the reason why the second deletion succeed. It is possible that cluster creation fails before node pool label set creations, causing the delete to fail on trying to delete non existing resource.