Open dedrone-fb opened 8 months ago
@dedrone-fb Do you have worker nodes running? The reason I ask is because it unclear what kind of EC2 instance types you fed to your cluster and whether they were provisioned.
You can run cdk deploy <your-blueprint-name> --no-rollback
to check the cluster state if provisioning fails, it prevents rollback and cleanup of resources.
Another possible reason is insufficient capacity. I assume cluster autoscaler should address it (it is in your list) but it may take longer than expected to roll out a new node and hence result in the timeout.
Please also share your props object: minSize, cluster version.
The following blueprint provisioned fine:
const addOns = [
new blueprints.addons.CalicoOperatorAddOn(),
new blueprints.addons.MetricsServerAddOn(),
new blueprints.addons.ClusterAutoScalerAddOn(),
new blueprints.addons.AwsLoadBalancerControllerAddOn(),
new blueprints.addons.VpcCniAddOn(),
new blueprints.addons.CoreDnsAddOn(),
new blueprints.addons.KubeProxyAddOn(),
new blueprints.addons.EbsCsiDriverAddOn()
];
const clusterProvider = new blueprints.MngClusterProvider();
const eksBlueprint = blueprints.EksBlueprint.builder()
.addOns(...addOns)
.region("us-east-1")
.version("auto")
.useDefaultSecretEncryption(true)
.clusterProvider(clusterProvider)
.name("reprod-case-ebs")
.build(app, "reprod-case-ebs");
I'd like to put this on hold. We currently suspect some kind of permission or quota problems. Removing any two addons seems to fix the problem (we tried with EBS CSI but without Calico and Metrics and it worked).
Will report back
I am seeing a similar issue with the following config
const addOns: Array<blueprints.ClusterAddOn> = [
new blueprints.addons.SecretsStoreAddOn({
rotationPollInterval: '120s',
syncSecrets: true
}),
argoAddon,
new blueprints.addons.CalicoOperatorAddOn(),
new blueprints.addons.MetricsServerAddOn(),
new blueprints.addons.ClusterAutoScalerAddOn(),
new blueprints.addons.AwsLoadBalancerControllerAddOn(),
new blueprints.addons.VpcCniAddOn(),
new blueprints.addons.CoreDnsAddOn(),
new blueprints.addons.KubeProxyAddOn(),
new blueprints.addons.OpaGatekeeperAddOn(),
];
const stack = blueprints.EksBlueprint.builder()
.account(account)
.region(region)
.version('auto')
.addOns(...addOns)
.useDefaultSecretEncryption(true)
.enableControlPlaneLogTypes(blueprints.ControlPlaneLogType.AUDIT)
.enableGitOps(blueprints.GitOpsMode.APPLICATION)
.teams(new TeamPlatform(props.gitops.platformTeamUserRoleArn), new TeamDeveloper(props.gitops.developerTeamUserRoleArn))
.build(app, id + '-eks-bps', { env: props.env });
Is this possibly related to https://github.com/aws/aws-cdk/issues/26838?
Update: Also tried without GitOps enabled and seeing the same issue.
Update: I can see the following error in CloudTrail around the time of the cdk deploy failure:
"eventTime": "2024-01-30T16:19:34Z",
"eventSource": "iam.amazonaws.com",
"eventName": "GetRolePolicy",
"awsRegion": "us-east-1",
"sourceIPAddress": "cloudformation.amazonaws.com",
"userAgent": "cloudformation.amazonaws.com",
"errorCode": "NoSuchEntityException",
"errorMessage": "The role policy with name ProviderframeworkonEventServiceRoleDefaultPolicy48CD2133 cannot be found.",
"requestParameters": {
"roleName": "workloadsdevelopmentworkl-ProviderframeworkonEventS-ERHAR0IF0eVi",
"policyName": "ProviderframeworkonEventServiceRoleDefaultPolicy48CD2133"
},
Updating as I've found the root cause for our timeout:
For us at least, this appears to be caused by Lambda Concurrency Limits in a new AWS account. The underlying EKS construct spins up many Lambdas as part of the KubectlProvider implementation. As CDK does the deploy, it waits for these lambdas to apply kubectl commands in the new cluster.
In our case, a new AWS account had a Concurrent Executions limit of 10 -- which is not high enough for the blueprint deploy and resulted in these Lambda requests being throttled (i.e. canceled with no error).
This problem is probably exacerbated if you are installing multiple Addons.
This does not appear to be an issue with cdk-eks-blueprints
, but I am posting here for awareness.
FYI @shapirov103
@hshepherd thank you for this insight, it would have been very hard for us to reproduce. The custom resource lambda is created to use all unreserved capacity. Hypothetically, if all addons are executed serially the issue should be mitigated if you have at least some concurrency available (e.g. kubectl commands will go one at a time, but other lambda functions may interfere). You can try defining strictly ordered behavior for all addons, e.g.
import "reflect-metadata";
Reflect.defineMetadata("ordered", true, addons. EbsCsiDriverAddOn); // repeat for all addons
This is more of an experimental feature tbh.
Describe the bug
We are trying to deploy an EKS Blueprint with the EBS CNI AddOn. We resproducibly run into this error message
Expected Behavior
EBS CNI Addon successfully added to cluster to be spawned
Current Behavior
Rollback initiated
Reproduction Steps
Possible Solution
No response
Additional Information/Context
Looked at and tried https://github.com/aws-samples/stable-diffusion-on-eks/pull/5 - but no luck
CDK CLI Version
2.115.0 (build 58027ee)
EKS Blueprints Version
1.13.1
Node.js Version
v18.16.0
Environment details (OS name and version, etc.)
Ubuntu Linux 22.04
Other information
No response