aws-quickstart / cdk-eks-blueprints

AWS Quick Start Team
Apache License 2.0
446 stars 198 forks source link

EbsCsiDriverAddon: Waiter has timed out #894

Open dedrone-fb opened 8 months ago

dedrone-fb commented 8 months ago

Describe the bug

We are trying to deploy an EKS Blueprint with the EBS CNI AddOn. We resproducibly run into this error message

10:56:04 AM | CREATE_FAILED        | Custom::AWSCDK-EKS-KubernetesResource | eks-stack/ebs-csi-...e/Resource/Default
Received response status [FAILED] from custom resource. Message returned: TimeoutError: {"state":"TIMEOUT","reason":"Waiter has timed out"}
at checkExceptions (/var/runtime/node_modules/@aws-sdk/util-waiter/dist-cjs/waiter.js:26:30)
at waitUntilFunctionActiveV2 (/var/runtime/node_modules/@aws-sdk/client-lambda/dist-cjs/waiters/waitForFunctionActiveV2.js:52:46)
at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
at async defaultInvokeFunction (/var/task/outbound.js:1:875)
at async invokeUserFunction (/var/task/framework.js:1:2192)
at async onEvent (/var/task/framework.js:1:369)
at async Runtime.handler (/var/task/cfn-response.js:1:1573) (RequestId: 3b206e15-a3df-4a4e-b222-b58893c77dd5)
10:56:06 AM | UPDATE_ROLLBACK_IN_P | AWS::CloudFormation::Stack            | eks-stack
The following resource(s) failed to create: [eksstackAwsAuthmanifest65E07027, eksstackebscsicontrollersasamanifestebscsicontrollersasaServiceAccountResource71971128].
10:56:14 AM | UPDATE_ROLLBACK_COMP | AWS::CloudFormation::Stack            | eks-stack

Expected Behavior

EBS CNI Addon successfully added to cluster to be spawned

Current Behavior

Rollback initiated

Reproduction Steps

        const addOns = [
            new eksblueprints.addons.CalicoOperatorAddOn(),
            new eksblueprints.addons.MetricsServerAddOn(),
            new eksblueprints.addons.ClusterAutoScalerAddOn(),
            new eksblueprints.addons.AwsLoadBalancerControllerAddOn(),
            new eksblueprints.addons.VpcCniAddOn(),
            new eksblueprints.addons.CoreDnsAddOn(),
            new eksblueprints.addons.KubeProxyAddOn(),
            new eksblueprints.addons.EbsCsiDriverAddOn()
        ];

        const clusterProvider = new eksblueprints.MngClusterProvider({
            version: props.version,
            minSize: props.minSize,
            maxSize: props.maxSize,
            instanceTypes: props.instanceTypes.map(s => new InstanceType(s)),
        });

        const eksBlueprint = eksblueprints.EksBlueprint.builder()
            .account(props.env!.account!)
            .region(props.env!.region!)
            .addOns(...addOns)
            .version(props.version)
            .useDefaultSecretEncryption(props.useDefaultSecretEncryption)
            .clusterProvider(clusterProvider)
            .name(props.clusterName)
            .build(app, id);

        this.blueprint = eksBlueprint;
        this.cluster = eksBlueprint.getClusterInfo().cluster;

Possible Solution

No response

Additional Information/Context

Looked at and tried https://github.com/aws-samples/stable-diffusion-on-eks/pull/5 - but no luck

CDK CLI Version

2.115.0 (build 58027ee)

EKS Blueprints Version

1.13.1

Node.js Version

v18.16.0

Environment details (OS name and version, etc.)

Ubuntu Linux 22.04

Other information

No response

shapirov103 commented 8 months ago

@dedrone-fb Do you have worker nodes running? The reason I ask is because it unclear what kind of EC2 instance types you fed to your cluster and whether they were provisioned.

You can run cdk deploy <your-blueprint-name> --no-rollback to check the cluster state if provisioning fails, it prevents rollback and cleanup of resources.

Another possible reason is insufficient capacity. I assume cluster autoscaler should address it (it is in your list) but it may take longer than expected to roll out a new node and hence result in the timeout.

Please also share your props object: minSize, cluster version.

shapirov103 commented 8 months ago

The following blueprint provisioned fine:

const addOns = [
    new blueprints.addons.CalicoOperatorAddOn(),
    new blueprints.addons.MetricsServerAddOn(),
    new blueprints.addons.ClusterAutoScalerAddOn(),
    new blueprints.addons.AwsLoadBalancerControllerAddOn(),
    new blueprints.addons.VpcCniAddOn(),
    new blueprints.addons.CoreDnsAddOn(),
    new blueprints.addons.KubeProxyAddOn(),
    new blueprints.addons.EbsCsiDriverAddOn()
];

const clusterProvider = new blueprints.MngClusterProvider();

const eksBlueprint = blueprints.EksBlueprint.builder()

    .addOns(...addOns)
    .region("us-east-1")
    .version("auto")
    .useDefaultSecretEncryption(true)
    .clusterProvider(clusterProvider)
    .name("reprod-case-ebs")
    .build(app, "reprod-case-ebs");
dedrone-fb commented 8 months ago

I'd like to put this on hold. We currently suspect some kind of permission or quota problems. Removing any two addons seems to fix the problem (we tried with EBS CSI but without Calico and Metrics and it worked).

Will report back

hshepherd commented 7 months ago

I am seeing a similar issue with the following config

        const addOns: Array<blueprints.ClusterAddOn> = [
            new blueprints.addons.SecretsStoreAddOn({
                rotationPollInterval: '120s',
                syncSecrets: true
            }),
            argoAddon,
            new blueprints.addons.CalicoOperatorAddOn(),
            new blueprints.addons.MetricsServerAddOn(),
            new blueprints.addons.ClusterAutoScalerAddOn(),
            new blueprints.addons.AwsLoadBalancerControllerAddOn(),
            new blueprints.addons.VpcCniAddOn(),
            new blueprints.addons.CoreDnsAddOn(),
            new blueprints.addons.KubeProxyAddOn(),
            new blueprints.addons.OpaGatekeeperAddOn(),
        ];

        const stack = blueprints.EksBlueprint.builder()
            .account(account)
            .region(region)
            .version('auto')
            .addOns(...addOns)
            .useDefaultSecretEncryption(true)
            .enableControlPlaneLogTypes(blueprints.ControlPlaneLogType.AUDIT)
            .enableGitOps(blueprints.GitOpsMode.APPLICATION)
            .teams(new TeamPlatform(props.gitops.platformTeamUserRoleArn), new TeamDeveloper(props.gitops.developerTeamUserRoleArn))
            .build(app, id + '-eks-bps', { env: props.env });
Screenshot 2024-01-29 at 12 18 29 PM

Is this possibly related to https://github.com/aws/aws-cdk/issues/26838?

Update: Also tried without GitOps enabled and seeing the same issue.

Update: I can see the following error in CloudTrail around the time of the cdk deploy failure:

    "eventTime": "2024-01-30T16:19:34Z",
    "eventSource": "iam.amazonaws.com",
    "eventName": "GetRolePolicy",
    "awsRegion": "us-east-1",
    "sourceIPAddress": "cloudformation.amazonaws.com",
    "userAgent": "cloudformation.amazonaws.com",
    "errorCode": "NoSuchEntityException",
    "errorMessage": "The role policy with name ProviderframeworkonEventServiceRoleDefaultPolicy48CD2133 cannot be found.",
    "requestParameters": {
        "roleName": "workloadsdevelopmentworkl-ProviderframeworkonEventS-ERHAR0IF0eVi",
        "policyName": "ProviderframeworkonEventServiceRoleDefaultPolicy48CD2133"
    },
hshepherd commented 7 months ago

Updating as I've found the root cause for our timeout:

For us at least, this appears to be caused by Lambda Concurrency Limits in a new AWS account. The underlying EKS construct spins up many Lambdas as part of the KubectlProvider implementation. As CDK does the deploy, it waits for these lambdas to apply kubectl commands in the new cluster.

In our case, a new AWS account had a Concurrent Executions limit of 10 -- which is not high enough for the blueprint deploy and resulted in these Lambda requests being throttled (i.e. canceled with no error).

This problem is probably exacerbated if you are installing multiple Addons. image

This does not appear to be an issue with cdk-eks-blueprints, but I am posting here for awareness. FYI @shapirov103

shapirov103 commented 7 months ago

@hshepherd thank you for this insight, it would have been very hard for us to reproduce. The custom resource lambda is created to use all unreserved capacity. Hypothetically, if all addons are executed serially the issue should be mitigated if you have at least some concurrency available (e.g. kubectl commands will go one at a time, but other lambda functions may interfere). You can try defining strictly ordered behavior for all addons, e.g.

import "reflect-metadata";

Reflect.defineMetadata("ordered", true, addons. EbsCsiDriverAddOn); // repeat for all addons

This is more of an experimental feature tbh.