aws-quickstart / cdk-eks-blueprints

AWS Quick Start Team
Apache License 2.0
432 stars 190 forks source link

(EksBlueprint name): (eke-blueprints version incompatibility) #1036

Open vpopiolrccl opened 1 week ago

vpopiolrccl commented 1 week ago

Describe the bug

Trying to cdk deploy for a cluster previously created with the @aws-quickstart/eks-blueprints version 1.14.1 after upgrading to @aws-quickstart/eks-blueprints v1.15.0, I get errors in the Cloud Formation events

Expected Behavior

No changes should be made to the cluster as nothing changed in the stack

Current Behavior

The Cluster Provider nested tack produces this error when creating the Provider Waiter State Machine: Resource handler returned message: "Resource of type 'AWS::Logs::LogGroup' with identifier '{"/properties/LogGroupName":"/aws/vendedlogs/states/waiter-state-machine-rcg-ecom-cluster-sandbox--ProviderframeworkisCompl-q4ar3IV7b2Li-c823a05924272663236e0df94090e3304c5d23966c"}' already exists." (RequestToken: 48ba77a8-b8d7-7e17-71f3-1e29a5cfca0d, HandlerErrorCode: AlreadyExists)

Reproduction Steps

Possible Solution

No response

Additional Information/Context

No response

CDK CLI Version

2.147.1

EKS Blueprints Version

1.15.0

Node.js Version

21.6.1

Environment details (OS name and version, etc.)

Mac OS 14.5

Other information

No response

vpopiolrccl commented 1 week ago

While doing some troubleshooting, I deleted the LogGroup that the message referred to and after executing a new cdk deploy, it completed successfully. This doesn't resolve the issue as we have multiples clusters running in different AWS accounts that would like to continue maintaining with the CDK blueprints. For those other accounts, access to make this type of changes (deleting a Log Group) is very restricted.

shapirov103 commented 1 week ago

@vpopiolrccl we just released 1.15.1 as a patch release for some of the backwards compatibility issues. Please give it a try with a cluster that was produced with 1.14.1 and if the issue persists, i will need a blueprint example to reproduce.

vpopiolrccl commented 1 week ago

@vpopiolrccl we just released 1.15.1 as a patch release for some of the backwards compatibility issues. Please give it a try with a cluster that was produced with 1.14.1 and if the issue persists, i will need a blueprint example to reproduce.

Thanks @shapirov103. I also tied 1.15.1 before opening the issue with the same results.

shapirov103 commented 1 week ago

Looking further into the log, I see that it is most likely related to the cluster log implementation addressing this issue: https://github.com/aws-quickstart/cdk-eks-blueprints/issues/997

Let me take a look if we can introduce an option to reuse the existing log group for that.

vpopiolrccl commented 1 week ago

But that Log Group seems to belong to the Step Function used by the Custom Resource

shapirov103 commented 1 week ago

Yes, the native CDK implementation of the logging is using step functions to orchestrate log creation after the cluster. I am unclear about the name collision. Do I assume correctly that you have the control plane logging enabled with the blueprint?

vpopiolrccl commented 1 week ago

Do I assume correctly that you have the control plane logging enabled with the blueprint?

Currently, not. But good point. Will most likely change this setting.

shapirov103 commented 1 week ago

Just FYI, I ran provisioning with 1.14.1 for a cluster that resembles your setup (I could not directly reproduce as I don't have access to you env settings and ami version that you use).

const stackID = `${id}-blueprint`;

        const clusterProps: blueprints.MngClusterProviderProps = {
            version: KubernetesVersion.V1_29,
            nodegroupName: 'my-ng',
            instanceTypes: [InstanceType.of(InstanceClass.M5, InstanceSize.LARGE)],
            minSize: 1,
            maxSize: 3
          }
          console.log(`clusterProps: ${JSON.stringify(clusterProps)}`)
          const clusterProvider = new blueprints.MngClusterProvider(clusterProps);

        blueprints.EksBlueprint.builder()
            .clusterProvider(clusterProvider)
            .addOns(
                new blueprints.AwsLoadBalancerControllerAddOn,
                new blueprints.VpcCniAddOn(), 
                new blueprints.MetricsServerAddOn,
                new blueprints.ClusterAutoScalerAddOn,
            )
            .teams()
            .build(scope, stackID);

Provisioned cluster with 1.14.1, then upgraded the blueprints to 1.15.1 and reran deploy. I got no errors, all addons were upgraded to the newer versions (e.g. loadbalancer, metrics server, etc.). That also confirms the experience from other customers who did not have issue with the log group when upgrading.

I will need a full blueprint example to reproduce.

vpopiolrccl commented 1 week ago

Thanks so much @shapirov103. Looks like the problem was with 1.15.0 and not with 1.15.1. It now works for me.

paulchambers commented 6 days ago

I'm also seeing failures when going from 1.14.1 to 1.15.1, my stack does have control plane logging enabled.

Resource handler returned message: "Resource of type 'AWS::Logs::LogGroup' with identifier '{"/properties/LogGroupName":"/aws/vendedlogs/states/waiter-state-machine-STACKNAME-ProviderframeworkisCompl-S6XDAkzUUmoq-c8b1cfed19641073278d59059a5ed9e648e1781c7c"}' already exists." (RequestToken: 5fd41341-3e15-f2e5-826f-2f51001f349e, HandlerErrorCode: AlreadyExists)

shapirov103 commented 6 days ago

@paulchambers These logs are not produced by the blueprints, they represent lambda logs for the custom resources in the CDK native implementation. I see somewhat related issue about it on the CDK repo here. If you can drop the log groups similar to what vpopiolrccl described, that would resolve it. Please also consider running the latest cdk bootstrap on the account/region.

If the problem persists, please share the blueprint to reproduce the issue.

paulchambers commented 11 hours ago

@shapirov103 manually removing the loggroup does clear the error, but i'm seeing it on each cluster that I upgrade to 1.15.1

When going from 1.14.1 to 1.15.1 the first deploy fails with "No changes needed for the logging config provided" from the Custom::AWSCDK-EKS-Cluster resource

Second attempt fails with the loggroup error as above

Removing the loggroup then allows the deploy to succeed

shapirov103 commented 8 hours ago

@paulchambers as I mentioned in https://github.com/aws-quickstart/cdk-eks-blueprints/issues/1036#issuecomment-2204385772 , in my test I provisioned a cluster with 1.14.1, upgraded to 1.15.1 and was able to deploy successfully, all addons were updated to the latest version. It could be an issue specific to the CDK upgrade, as these log groups are created by the CDK impl.

If there is an example that I can use to reproduce the issue, I am happy to give it a shot, if needed I will create an issue against CDK.