Open chauncey-garrett opened 3 years ago
I think there's 2 issues here.
ISOLATED
subnet, this is no longer an issue. I'll open a second ticket for this issue.aws-auth
manifest creation step (e.g., simplyesandboxClusterAwsAuthmanifestD46309E0
). It simply hangs.RE issue 2:
After the 2 lambda app stacks are created, the aws-auth
manifest is created. In the lambda's logs I see that the the manifest is successfully created and a SUCCESS
response is submitted. However, the CF cluster stack never moves forward.
SUCCESS
message)Issue 1 is possibly related to these lines: https://github.com/aws/aws-cdk/blob/master/packages/@aws-cdk/aws-eks/lib/cluster.ts#L1445-L1448 and #12171.
Looking at your logs, it seems that this function was not called:
The log statement of line 34 is not printed and one of the log lines reads onEventReturned: null
, when it should contain a PhysicalResourceId
.
We need to spend more time investigating this to understand what's happening and find a solution. However, we can't work on this right now, so I'm marking this as a p2
, since it doesn't seem to be affecting other people.
As always, we use +1s to help prioritize our work, and are happy to revaluate this issue based on community feedback. You can reach out to the cdk.dev community on Slack to solicit support for reprioritization.
Adding another data point: I ran into this issue as well while adding a service account to an existing cluster:
const serviceAccount = cluster.addServiceAccount('cluster-autoscaler', {
name: 'cluster-autoscaler',
namespace: this.namespace,
});
I verified the service account was added to the cluster. There was a null
response for the payload in the lambda similar to the OnEvent log (above).
This was with cdk v1.119.
I very recently had this issue too. I discovered that there was an s3 endpoint configured on the vpc and it was only allowing access to s3 buckets in that account. I had to add an allow for "arn:aws:s3:::cloudformation-custom-resource-response-useast2", i was running in us-east-2, in that s3 endpoint. I did find the timeout corresponding to this cloudwatch, and the documentation on prod-us-east-2-starport-layer-bucket to help me figure this out.
I had the same problem and still don't know how to solve it. My stack hangs at aws-auth and not only that, it takes like 30 minutes to reach that point. I think I will fall back to Constructs level 1, this high level construct seemed very friendly but never could make to see the light, even using NAT gateway and private subnets with egress.
+1 Happening on CDK 2.58.1 with EKS v1.24/v1.23 as well.
Too much bug to consider CDK for EKS is stable, I think I will use eksctl.
Finally I found the problem and manage to resolve the issue. On my case, I'm using SubnetSelection with Subnet attribute to explicitly select the subnet, however after reading carefully, all the subnet we select will result in private subnet. It makes intermittent error since public subnet also assigned to lambda function does not have internet connectifity which resulting timeout. Making it worse the timeout is 15 minutes+retry attempt, the delete process got the same issue resulting up to 3 hour before we can delete the stack
It would be great if we can change SubnetSelection property Subnets into PrivateSubnets to avoid this misunderstanding.
We're experiencing an issue where an EKS cluster is deployed, the
ProviderframeworkisComplete
lambda will report backSUCCESS
, but the CF stack will not move further along and create the node group we've specified. It's as if CF doesn't get theSUCCESS
response. The stack will eventually timeout and rollback leading to another issue where theOnEventHandler
reports an error that it cannot delete the cluster.Perhaps there's something simple I've missed but I have yet to see what the error is here.
Reproduction Steps
VPC:
Cluster:
What did you expect to happen?
EKS cluster creation.
What actually happened?
Timeout and rollback of the stack even though the cluster was created.
Environment
Other
OnEventHandler cluster delete log
This is :bug: Bug Report