aws-quickstart / cdk-eks-blueprints

AWS Quick Start Team
Apache License 2.0
446 stars 198 forks source link

GenericClusterProvider: CDK deployment freezes during aws-auth manifest resource creation #910

Open sdpoueme opened 7 months ago

sdpoueme commented 7 months ago

Describe the bug

The deployment of a generic cluster built with the blueprint never completes and timeouts. Then CDK rolls back the stack after the failure.

image

Expected Behavior

The CDK deployment should complete and exit successfully after the deployment of the aws-auth manifest.

Current Behavior

The deployment does not exist, eventually CDK fails due to a timeout and rolls back the stack. rollback also fails and stack stays in a failure state. The stack has to be deleted before another deployment can happen. The timeout happens when the stack is deploying the aws-auth manifest:

image

Reproduction Steps

  1. Create a generic cluster
  2. Build the cluster using the blueprint
  3. Wait for the deployment to succeed

Possible Solution

Provide an option to skip the deployment of the aws-auth manifest.

Additional Information/Context

No response

CDK CLI Version

2.115.0

EKS Blueprints Version

1.13.1

Node.js Version

18.0.0

Environment details (OS name and version, etc.)

Mac OS 14.2.1 (23C71)

Other information

The issue occurs only when creating a generic cluster. It does not happen with the ASG provider or the Managed Nodes provider.

sdpoueme commented 7 months ago

The only workaround I found so far is to set the endpointAccess attribute to endpointAccess: EndpointAccess.PUBLIC and add a call to .teams() in blueprint builder.

shapirov103 commented 7 months ago

We test the above steps for every release as well as individually. Is there anything specific to your blueprint that is causing the issue? Was a custom VPC used? Maybe you can share the blueprint that we can use to reproduce. @sdpoueme

sdpoueme commented 6 months ago

Hi @shapirov103 sorry for the delayed answer. It seems not having the call to .teams() is causing the issue. The VPC used is custom. Here is the blueprint: https://github.com/sdpoueme/edge_diffusion_on_eks/tree/master/infra-build