(eks): EKS cluster is created but the stack times out

chauncey-garrett commented 3 years ago

We're experiencing an issue where an EKS cluster is deployed, the ProviderframeworkisComplete lambda will report back SUCCESS, but the CF stack will not move further along and create the node group we've specified. It's as if CF doesn't get the SUCCESS response. The stack will eventually timeout and rollback leading to another issue where the OnEventHandler reports an error that it cannot delete the cluster.

Perhaps there's something simple I've missed but I have yet to see what the error is here.

Reproduction Steps

VPC:

    const vpc = new ec2.Vpc(this, 'VPC', {
      subnetConfiguration: [
        {
          cidrMask: 24,
          name: 'Ingress',
          subnetType: ec2.SubnetType.PUBLIC,
        },
        {
          cidrMask: 24,
          name: 'Application',
          subnetType: ec2.SubnetType.PRIVATE,
        },
        {
          cidrMask: 28,
          name: 'Database',
          subnetType: ec2.SubnetType.ISOLATED,
        }
      ]
    });

Cluster:

    const mastersRole = new iam.Role(this, 'MastersRole', {
      assumedBy: new iam.AccountRootPrincipal(),
    });

    // The IAM role that will be used by EKS
    const role = new iam.Role(this, 'ClusterRole', {
      assumedBy: new iam.ServicePrincipal('eks.amazonaws.com'),
      managedPolicies: [
        iam.ManagedPolicy.fromAwsManagedPolicyName('AmazonEKSClusterPolicy'),
        iam.ManagedPolicy.fromAwsManagedPolicyName('AmazonEKSServicePolicy'),
        iam.ManagedPolicy.fromAwsManagedPolicyName('AmazonEKSVPCResourceController'), // NOTE: Required for Security Groups for pods
      ],
    });

    // The EKS cluster, without worker nodes as we'll add them later
    const cluster = new eks.Cluster(this, 'SimplyECluster', {
      clusterName: `${product}-${environment}-cluster`,
      defaultCapacity: 0,
      mastersRole,
      outputClusterName: true,
      outputConfigCommand: true,
      outputMastersRoleArn: true,
      placeClusterHandlerInVpc: true,
      role,
      version: eks.KubernetesVersion.V1_20,
      vpc,
    });
    cluster.node.addDependency(mastersRole);
    cluster.node.addDependency(role);

    // Managed Worker Nodes
    //

    // Worker node IAM role
    const nodeRole = new iam.Role(this, 'NodeRole', {
      assumedBy: new iam.ServicePrincipal('ec2.amazonaws.com'),
      managedPolicies: [
        iam.ManagedPolicy.fromAwsManagedPolicyName('AmazonEC2ContainerRegistryReadOnly'),
        iam.ManagedPolicy.fromAwsManagedPolicyName('AmazonEKSVPCResourceController'), // Allows us to use Security Groups for pods
        iam.ManagedPolicy.fromAwsManagedPolicyName('AmazonEKSWorkerNodePolicy'),
        iam.ManagedPolicy.fromAwsManagedPolicyName('AmazonEKS_CNI_Policy'),
        iam.ManagedPolicy.fromAwsManagedPolicyName('ElasticLoadBalancingFullAccess'),
      ],
    });

    const nodeGroup = cluster.addNodegroupCapacity('NodeGroup', {
      subnets: vpc.selectSubnets({ subnetType: ec2.SubnetType.PRIVATE, }),
      nodeRole,
      maxSize: 20,
      minSize: 3,
    });
    nodeGroup.node.addDependency(nodeRole);

What did you expect to happen?

EKS cluster creation.

What actually happened?

Timeout and rollback of the stack even though the cluster was created.

Environment

CDK CLI Version : 7.18.1
Framework Version: v1.114.0
Node.js Version: v16.4.1
Language (Version): TypeScript (3.9.10)

Other

OnEventHandler cluster delete log

"2021-07-16T16":"23":45.692Z 7654cee8-a8e2-4072-b62f-6498e4498a5e ERROR Invoke Error{
   "errorType":"AccessDeniedException",
   "errorMessage":"User: arn:aws:sts::535241886961:assumed-role/simplye-dev-infra-Cluster-SimplyEClusterCreationRo-1EXTOZ12F2O0U/AWSCDK.EKSCluster.Delete.33864f3f-924a-493f-bd0f-06392c788668 is not authorized to perform: eks:DeleteCluster on resource: arn:aws:eks:us-east-2:535241886961:cluster/simplye-dev-infra-ClusterNestedStackClusterNestedStackResourceC524F2E7-1P3CRQCDZN4ZI-SimplyEClusterC241F8AE-1XO6PW9I5RD9U",
   "code":"AccessDeniedException",
   "message":"User: arn:aws:sts::535241886961:assumed-role/simplye-dev-infra-Cluster-SimplyEClusterCreationRo-1EXTOZ12F2O0U/AWSCDK.EKSCluster.Delete.33864f3f-924a-493f-bd0f-06392c788668 is not authorized to perform: eks:DeleteCluster on resource: arn:aws:eks:us-east-2:535241886961:cluster/simplye-dev-infra-ClusterNestedStackClusterNestedStackResourceC524F2E7-1P3CRQCDZN4ZI-SimplyEClusterC241F8AE-1XO6PW9I5RD9U",
   "time":"2021-07-16T16:23:45.673Z",
   "requestId":"4042285d-156e-4824-8f51-b29f5a67858a",
   "statusCode":403,
   "retryable":false,
   "retryDelay":3.510279726269361,
   "stack":[
      "AccessDeniedException: User: arn:aws:sts::535241886961:assumed-role/simplye-dev-infra-Cluster-SimplyEClusterCreationRo-1EXTOZ12F2O0U/AWSCDK.EKSCluster.Delete.33864f3f-924a-493f-bd0f-06392c788668 is not authorized to perform: eks:DeleteCluster on resource: arn:aws:eks:us-east-2:535241886961:cluster/simplye-dev-infra-ClusterNestedStackClusterNestedStackResourceC524F2E7-1P3CRQCDZN4ZI-SimplyEClusterC241F8AE-1XO6PW9I5RD9U",
      " at Object.extractError (/var/runtime/node_modules/aws-sdk/lib/protocol/json.js:52:27)",
      " at Request.extractError (/var/runtime/node_modules/aws-sdk/lib/protocol/rest_json.js:55:8)",
      " at Request.callListeners (/var/runtime/node_modules/aws-sdk/lib/sequential_executor.js:106:20)",
      " at Request.emit (/var/runtime/node_modules/aws-sdk/lib/sequential_executor.js:78:10)",
      " at Request.emit (/var/runtime/node_modules/aws-sdk/lib/request.js:688:14)",
      " at Request.transition (/var/runtime/node_modules/aws-sdk/lib/request.js:22:10)",
      " at AcceptorStateMachine.runTo (/var/runtime/node_modules/aws-sdk/lib/state_machine.js:14:12)",
      " at /var/runtime/node_modules/aws-sdk/lib/state_machine.js:26:10",
      " at Request.<anonymous> (/var/runtime/node_modules/aws-sdk/lib/request.js:38:9)",
      " at Request.<anonymous> (/var/runtime/node_modules/aws-sdk/lib/request.js:690:12)"
   ]
}

This is :bug: Bug Report

chauncey-garrett commented 3 years ago

I think there's 2 issues here.

The first is that nodes fail to join the cluster with the VPC subnet configuration. When I remove the ISOLATED subnet, this is no longer an issue. I'll open a second ticket for this issue.
After removing the aforementioned subnet, nodes will join the cluster but I run into an issue where CF cannot get past the aws-auth manifest creation step (e.g., simplyesandboxClusterAwsAuthmanifestD46309E0). It simply hangs.

RE issue 2:

After the 2 lambda app stacks are created, the aws-auth manifest is created. In the lambda's logs I see that the the manifest is successfully created and a SUCCESS response is submitted. However, the CF cluster stack never moves forward.

OnEvent log (`SUCCESS` message)

CleanShot 2021-07-20 at 12 52 07

Handler log (manifest creation)

CleanShot 2021-07-20 at 12 47 01

chauncey-garrett commented 3 years ago

Issue 1 is possibly related to these lines: https://github.com/aws/aws-cdk/blob/master/packages/@aws-cdk/aws-eks/lib/cluster.ts#L1445-L1448 and #12171.

otaviomacedo commented 3 years ago

Looking at your logs, it seems that this function was not called:

https://github.com/aws/aws-cdk/blob/53e7622dd6e7ab7aed9d55292fabcc04f82668c2/packages/@aws-cdk/aws-eks/lib/cluster-resource-handler/cluster.ts#L33-L53

The log statement of line 34 is not printed and one of the log lines reads onEventReturned: null, when it should contain a PhysicalResourceId.

We need to spend more time investigating this to understand what's happening and find a solution. However, we can't work on this right now, so I'm marking this as a p2, since it doesn't seem to be affecting other people.

As always, we use +1s to help prioritize our work, and are happy to revaluate this issue based on community feedback. You can reach out to the cdk.dev community on Slack to solicit support for reprioritization.

chauncey-garrett commented 3 years ago

Adding another data point: I ran into this issue as well while adding a service account to an existing cluster:

    const serviceAccount = cluster.addServiceAccount('cluster-autoscaler', {
      name: 'cluster-autoscaler',
      namespace: this.namespace,
    });

I verified the service account was added to the cluster. There was a null response for the payload in the lambda similar to the OnEvent log (above).

This was with cdk v1.119.

cjcooper commented 2 years ago

I very recently had this issue too. I discovered that there was an s3 endpoint configured on the vpc and it was only allowing access to s3 buckets in that account. I had to add an allow for "arn:aws:s3:::cloudformation-custom-resource-response-useast2", i was running in us-east-2, in that s3 endpoint. I did find the timeout corresponding to this cloudwatch, and the documentation on prod-us-east-2-starport-layer-bucket to help me figure this out.

martinKindall commented 1 year ago

I had the same problem and still don't know how to solve it. My stack hangs at aws-auth and not only that, it takes like 30 minutes to reach that point. I think I will fall back to Constructs level 1, this high level construct seemed very friendly but never could make to see the light, even using NAT gateway and private subnets with egress.

dimmyshu commented 1 year ago

+1 Happening on CDK 2.58.1 with EKS v1.24/v1.23 as well.

Too much bug to consider CDK for EKS is stable, I think I will use eksctl.

dimmyshu commented 1 year ago

Finally I found the problem and manage to resolve the issue. On my case, I'm using SubnetSelection with Subnet attribute to explicitly select the subnet, however after reading carefully, all the subnet we select will result in private subnet. It makes intermittent error since public subnet also assigned to lambda function does not have internet connectifity which resulting timeout. Making it worse the timeout is 15 minutes+retry attempt, the delete process got the same issue resulting up to 3 hour before we can delete the stack

It would be great if we can change SubnetSelection property Subnets into PrivateSubnets to avoid this misunderstanding.

aws / aws-cdk