aws-quickstart / cdk-eks-blueprints

AWS Quick Start Team
Apache License 2.0
446 stars 198 forks source link

cdk deploy: Waiter times out on clusterautoscaler #856

Open bconner22 opened 11 months ago

bconner22 commented 11 months ago

Describe the bug

Following this link.
I did this yesterday afternoon, and again this morning, the stack failed the same way

From the Cloudformation console:

2023-10-10 09:36:50 UTC-0500 eksblueprintblueprintsaddonclusterautoscalersamanifestblueprintsaddonclusterautoscalersaServiceAccountResource72D82586
CREATE_FAILED Received response status [FAILED] from custom resource. Message returned: TimeoutError: {"state":"TIMEOUT","reason":"Waiter has timed out"} at checkExceptions (/var/runtime/node_modules/@aws-sdk/util-waiter/dist-cjs/waiter.js:26:30) at waitUntilFunctionActiveV2 (/var/runtime/node_modules/@aws-sdk/client-lambda/dist-cjs/waiters/waitForFunctionActiveV2.js:52:46) at process.processTicksAndRejections (node:internal/process/task_queues:95:5) at async defaultInvokeFunction (/var/task/outbound.js:1:875) at async invokeUserFunction (/var/task/framework.js:1:2192) at async onEvent (/var/task/framework.js:1:369) at async Runtime.handler (/var/task/cfn-response.js:1:1573)

From my cli: Do you wish to deploy these changes (y/n)? y eks-blueprint: deploying... [1/1] eks-blueprint: creating CloudFormation changeset... [█████████████████████████████████▎························] (46/80)

9:36:50 AM | CREATE_FAILED | Custom::AWSCDK-EKS-KubernetesResource | eks-blueprint/blue...e/Resource/Default Received response status [FAILED] from custom resource. Message returned: TimeoutError: {"state":"TIMEOUT","reason": 9:36:50 AM | CREATE_FAILED | Custom::AWSCDK-EKS-KubernetesResource | eksblueprintbluepr...ntResource72D82586 Received response status [FAILED] from custom resource. Message returned: TimeoutError: {"state":"TIMEOUT","reason" :"Waiter has timed out"} at checkExceptions (/var/runtime/node_modules/@aws-sdk/util-waiter/dist-cjs/waiter.js:26:30) at waitUntilFunctionActiveV2 (/var/runtime/node_modules/@aws-sdk/client-lambda/dist-cjs/waiters/waitForFunctionActi veV2.js:52:46) at process.processTicksAndRejections (node:internal/process/task_queues:95:5) at async defaultInvokeFunction (/var/task/outbound.js:1:875) at async invokeUserFunction (/var/task/framework.js:1:2192) at async onEvent (/var/task/framework.js:1:369) at async Runtime.handler (/var/task/cfn-response.js:1:1573) (RequestId: 9c36b5b4-88cb-45af-b4cb-1f1056a35886)

❌ eks-blueprint failed: Error: The stack named eks-blueprint failed creation, it may need to be manually deleted from the AWS console: ROLLBACK_COMPLETE: Received response status [FAILED] from custom resource. Message returned: TimeoutError: {"state":"TIMEOUT","reason":"Waiter has timed out"} at checkExceptions (/var/runtime/node_modules/@aws-sdk/util-waiter/dist-cjs/waiter.js:26:30) at waitUntilFunctionActiveV2 (/var/runtime/node_modules/@aws-sdk/client-lambda/dist-cjs/waiters/waitForFunctionActiveV2.js:52:46) at process.processTicksAndRejections (node:internal/process/task_queues:95:5) at async defaultInvokeFunction (/var/task/outbound.js:1:875) at async invokeUserFunction (/var/task/framework.js:1:2192) at async onEvent (/var/task/framework.js:1:369) at async Runtime.handler (/var/task/cfn-response.js:1:1573) (RequestId: 9c36b5b4-88cb-45af-b4cb-1f1056a35886) at FullCloudFormationDeployment.monitorDeployment (/usr/local/lib/node_modules/aws-cdk/lib/index.js:467:10232) at process.processTicksAndRejections (node:internal/process/task_queues:95:5) at async Object.deployStack2 [as deployStack] (/usr/local/lib/node_modules/aws-cdk/lib/index.js:470:179911) at async /usr/local/lib/node_modules/aws-cdk/lib/index.js:470:163159

❌ Deployment failed: Error: The stack named eks-blueprint failed creation, it may need to be manually deleted from the AWS console: ROLLBACK_COMPLETE: Received response status [FAILED] from custom resource. Message returned: TimeoutError: {"state":"TIMEOUT","reason":"Waiter has timed out"} at checkExceptions (/var/runtime/node_modules/@aws-sdk/util-waiter/dist-cjs/waiter.js:26:30) at waitUntilFunctionActiveV2 (/var/runtime/node_modules/@aws-sdk/client-lambda/dist-cjs/waiters/waitForFunctionActiveV2.js:52:46) at process.processTicksAndRejections (node:internal/process/task_queues:95:5) at async defaultInvokeFunction (/var/task/outbound.js:1:875) at async invokeUserFunction (/var/task/framework.js:1:2192) at async onEvent (/var/task/framework.js:1:369) at async Runtime.handler (/var/task/cfn-response.js:1:1573) (RequestId: 9c36b5b4-88cb-45af-b4cb-1f1056a35886) at FullCloudFormationDeployment.monitorDeployment (/usr/local/lib/node_modules/aws-cdk/lib/index.js:467:10232) at process.processTicksAndRejections (node:internal/process/task_queues:95:5) at async Object.deployStack2 [as deployStack] (/usr/local/lib/node_modules/aws-cdk/lib/index.js:470:179911) at async /usr/local/lib/node_modules/aws-cdk/lib/index.js:470:163159

The stack named eks-blueprint failed creation, it may need to be manually deleted from the AWS console: ROLLBACK_COMPLETE: Received response status [FAILED] from custom resource. Message returned: TimeoutError: {"state":"TIMEOUT","reason":"Waiter has timed out"} at checkExceptions (/var/runtime/node_modules/@aws-sdk/util-waiter/dist-cjs/waiter.js:26:30) at waitUntilFunctionActiveV2 (/var/runtime/node_modules/@aws-sdk/client-lambda/dist-cjs/waiters/waitForFunctionActiveV2.js:52:46) at process.processTicksAndRejections (node:internal/process/task_queues:95:5) at async defaultInvokeFunction (/var/task/outbound.js:1:875) at async invokeUserFunction (/var/task/framework.js:1:2192) at async onEvent (/var/task/framework.js:1:369) at async Runtime.handler (/var/task/cfn-response.js:1:1573) (RequestId: 9c36b5b4-88cb-45af-b4cb-1f1056a35886)

Expected Behavior

The cluster and addons to deploy

Current Behavior

Errors are above

Reproduction Steps

Follow https://aws-quickstart.github.io/cdk-eks-blueprints/getting-started/

Possible Solution

Does the waiter need to wait for longer?

Additional Information/Context

I'm in an AWS Orgs management account, using an IAM user, but otherwise the account is empty. The lambdas did appear to deploy correctly, and both they and the EKS cluster were in us-east-1. I did also cdk bootstrap aws://<MY_ACCOUNT_NUMBER>/us-east-1 as I saw someone ask to confirm that on a similar issue.

CDK CLI Version

2.99.1 (build b2a895e)

EKS Blueprints Version

1.12.0

Node.js Version

v20.8.0

Environment details (OS name and version, etc.)

OSX on Intel chip

Other information

No response

AsimPoptani commented 11 months ago

Looking at the cloud formation on the aws web interface and looking at your stack. Look for anything that has failed what reason does it say? I have not come across this exact issue but from experience this feels like a IAM permission issue for your account.

bconner22 commented 11 months ago

Hey Asim, thanks for the insight. The AWS web interface had the following in CloudFormation for the error:

2023-10-10 09:36:50 UTC-0500 eksblueprintblueprintsaddonclusterautoscalersamanifestblueprintsaddonclusterautoscalersaServiceAccountResource72D82586
CREATE_FAILED
Received response status [FAILED] from custom resource. Message returned: TimeoutError: {"state":"TIMEOUT","reason":"Waiter has timed out"} at checkExceptions (/var/runtime/node_modules/@aws-sdk/util-waiter/dist-cjs/waiter.js:26:30) at waitUntilFunctionActiveV2 (/var/runtime/node_modules/@aws-sdk/client-lambda/dist-cjs/waiters/waitForFunctionActiveV2.js:52:46) at process.processTicksAndRejections (node:internal/process/task_queues:95:5) at async defaultInvokeFunction (/var/task/outbound.js:1:875) at async invokeUserFunction (/var/task/framework.js:1:2192) at async onEvent (/var/task/framework.js:1:369) at async Runtime.handler (/var/task/cfn-response.js:1:1573)

The user I'm using from the cli is an admin user, which I believe only prevents one from seeing billing. The module of course does spin up many IAM roles that it's using, are you thinking that it might be one of those?

AsimPoptani commented 11 months ago

Hmm, it does not look like a perms issue then if you are using admin. The only thing that I think may help your case is to delete the stack completely and try again. This may involve deleting some resources manually. Otherwise, I am not sure what the issue could be. Sorry that I cannot be of more help.

elamaran11 commented 11 months ago

@bconner22 I would recommending to do a full cleanup and run again. I would assume this to be a temporary onetime issue. Please keep us posted.

hshepherd commented 7 months ago

Crossposting as I believe these two issues are related: https://github.com/aws-quickstart/cdk-eks-blueprints/issues/894#issuecomment-1921585477

shapirov103 commented 7 months ago

@bconner22 as stated in the #894, concurrency executions service quota per account may be the issue. Another possible root cause is the default quota of 1000 is exhausted in the account because of other lambda functions deployed in the same account (this could be sporadic).