aws-eks: Constructing a `AwsAuth` can completely break a cluster

jfly commented 9 months ago

Describe the bug

We recently added a new AwsAuth construct to our cdk app. We were shocked to find that this broke all our node groups (as the new AwsAuth construct overwrote the aws-auth ConfigMap from cdk's internally managed AwsAuth construct).

Expected Behavior

I expect CDK to enforce that there is exactly one AwsAuth construct per cluster.

Current Behavior

As mentioned above, the new AwsAuth construct clobbered the aws-auth ConfigMap managed by the cluster's internal AwsAuth construct.

Reproduction Steps

Here's a simple cdk app that declares a cluster, a nodegroup for that cluster, and an explicit AwsAuth for that cluster:

$ cat demo.py
import pprint

import aws_cdk as cdk
from aws_cdk import aws_ec2
from aws_cdk import aws_eks
from aws_cdk.assertions import Template

class DemoStack(cdk.Stack):
    def __init__(self, scope):
        super().__init__(scope, "DemoStack")

        cluster = aws_eks.FargateCluster(
            scope=self,
            id="demo-cluster",
            cluster_name="demo-cluster",
            version=aws_eks.KubernetesVersion.V1_24,
        )

        cluster.add_nodegroup_capacity(
            'demo-nodegroup',
            nodegroup_name='demo-nodegroup',
            instance_types=[aws_ec2.InstanceType('c6a.xlarge')],
        )

        aws_eks.AwsAuth(self, "problematic-aws-auth", cluster=cluster)

app = cdk.App()
demo_stack = DemoStack(app)

template = Template.from_stack(demo_stack)
custom_resources = template.find_resources("Custom::AWSCDK-EKS-KubernetesResource")
pprint.pprint(custom_resources)

Note how this produces 2 Custom::AWSCDK-EKS-KubernetesResource resources. These two resources will happily step on each other's toes (because of this (unfortunately necessary) overwrite: true).

$ python demo.py
{'democlusterAwsAuthmanifest59F406F8': {'DeletionPolicy': 'Delete',
                                        'DependsOn': ['democlusterKubectlReadyBarrier409E0356'],
                                        'Properties': {'ClusterName': {'Ref': 'democlusterE73BD733'},
                                                       'Manifest': {'Fn::Join': ['',
                                                                                 ['[{"apiVersion":"v1","kind":"ConfigMap","metadata":{"name":"aws-auth","namespace":"kube-system","labels":{"aws.cdk.eks/prune-c8e20f52ee1502bfdc03c85de68fda3a9cbb05b58d":""}},"data":{"mapRoles":"[{\\"rolearn\\":\\"',
                                                                                  {'Fn::GetAtt': ['democlusterfargateprofiledefaultPodExecutionRoleB482677B',
                                                                                                  'Arn']},
                                                                                  '\\",\\"username\\":\\"system:node:{{SessionName}}\\",\\"groups\\":[\\"system:bootstrappers\\",\\"system:nodes\\",\\"system:node-proxier\\"]},{\\"rolearn\\":\\"',
                                                                                  {'Fn::GetAtt': ['democlusterNodegroupdemonodegroupNodeGroupRole6A5CDA1C',
                                                                                                  'Arn']},
                                                                                  '\\",\\"username\\":\\"system:node:{{EC2PrivateDNSName}}\\",\\"groups\\":[\\"system:bootstrappers\\",\\"system:nodes\\"]}]","mapUsers":"[]","mapAccounts":"[]"}}]']]},
                                                       'Overwrite': True,
                                                       'PruneLabel': 'aws.cdk.eks/prune-c8e20f52ee1502bfdc03c85de68fda3a9cbb05b58d',
                                                       'RoleArn': {'Fn::GetAtt': ['democlusterCreationRole733E3675',
                                                                                  'Arn']},
                                                       'ServiceToken': {'Fn::GetAtt': ['awscdkawseksKubectlProviderNestedStackawscdkawseksKubectlProviderNestedStackResourceA7AEBA6B',
                                                                                       'Outputs.DemoStackawscdkawseksKubectlProviderframeworkonEvent705EBCF2Arn']}},
                                        'Type': 'Custom::AWSCDK-EKS-KubernetesResource',
                                        'UpdateReplacePolicy': 'Delete'},
 'problematicawsauthmanifestEB215835': {'DeletionPolicy': 'Delete',
                                        'DependsOn': ['democlusterKubectlReadyBarrier409E0356'],
                                        'Properties': {'ClusterName': {'Ref': 'democlusterE73BD733'},
                                                       'Manifest': '[{"apiVersion":"v1","kind":"ConfigMap","metadata":{"name":"aws-auth","namespace":"kube-system","labels":{"aws.cdk.eks/prune-c8993b9de23d4ff03256c14b261e46b123d26fe727":""}},"data":{"mapRoles":"[]","mapUsers":"[]","mapAccounts":"[]"}}]',
                                                       'Overwrite': True,
                                                       'PruneLabel': 'aws.cdk.eks/prune-c8993b9de23d4ff03256c14b261e46b123d26fe727',
                                                       'RoleArn': {'Fn::GetAtt': ['democlusterCreationRole733E3675',
                                                                                  'Arn']},
                                                       'ServiceToken': {'Fn::GetAtt': ['awscdkawseksKubectlProviderNestedStackawscdkawseksKubectlProviderNestedStackResourceA7AEBA6B',
                                                                                       'Outputs.DemoStackawscdkawseksKubectlProviderframeworkonEvent705EBCF2Arn']}},
                                        'Type': 'Custom::AWSCDK-EKS-KubernetesResource',
                                        'UpdateReplacePolicy': 'Delete'}}

Possible Solution

I have read through https://github.com/aws/aws-cdk/issues/19218, which talks about a similar issue (the aws-auth ConfigMap getting unexpectedly clobbered), but that's all about using multiple tools to manage a cluster. In the scenario described here, we're using exactly 1 CDK app to manage our cluster, and IMO, CDK should protect us against this.

Some ideas for how to fix this:

When constructing a AwsAuth for a cluster, check if there's already a AwsAuth instantiated for that cluster. If there is one already, crash.
Don't expose the AwsAuth construct at all, just keep it as an internal implementation detail of Cluster. As I understand things, it's never something you'd want to use directly when you've declared your cluster in cdk (instead you'd want to interact with the underlying cluster.awsAuth attribute as documented here).
(what we're going to do until this is addressed in cdk itself) Add a post-synthesis step that loops over all AwsAuth constructs in the tree and asserts that they're for different clusters.

Additional Information/Context

No response

CDK CLI Version

2.109.0 (build 941dc16)

Framework Version

No response

Node.js Version

v18.15.0

OS

N/A (this happens both on macOS and Linux)

Language

Python

Language Version

Python (3.10.6)

Other information

No response

pahud commented 9 months ago

Thanks for the report.

Actually every Cluster comes with a built-in AwsAuth and you don't need to create a new one, instead, you should update the existing one like this as described in the doc:

cluster.awsAuth.addMastersRole(role);

Is there any reason you need to create another one for the existing cluster?

jfly commented 9 months ago

Is there any reason you need to create another one for the existing cluster?

@pahud, no. For more context: we recently had a complete outage of one of our clusters because we deployed a diff adding a new AwsAuth construct to our cdk app. That definitely wasn't what we meant to do (as you said, we should have used cluster.awsAuth.addMastersRole), but mistakes happen, and IMO this is very dangerous footgun for CDK to have.

I totally understand that it's always going to be possible for people to do dangerous things with an IAC tool (we could delete our cluster, for instance), but we are diligent about reviewing all code changes to our CDK app, and nobody at our company recognized that this diff would cause a cluster outage:

--- before
+++ after
@@ -14,3 +14,5 @@
             nodegroup_name='demo-nodegroup',
             instance_types=[aws_ec2.InstanceType('c6a.xlarge')],
         )
+
+        aws_eks.AwsAuth(self, "problematic-aws-auth", cluster=cluster)

Running cdk diff doesn't help either, you just see the addition of a custom resource, with no clue that this is going to do something very dangerous:

$ cdk diff
Stack DemoStack
Resources
[+] Custom::AWSCDK-EKS-KubernetesResource problematic-aws-auth/manifest/Resource problematicawsauthmanifestEB215835

In other words: it's totally ok for CDK to enable us to do dangerous things, but (IMO) those dangerous things need to at a least look dangerous. For example, deleting a cluster is very obviously a dangerous thing, both when you look at the code diff, and the cdk diff. As an example, if the AwsAuth class were instead renamed to something like DangerouslyClobberAwsAuthInYourClusterYouProbablyDoNotWantToConstructMeYourself, then we probably wouldn't have had our outage =)

That all said, I still think it would be better for CDK to just ensure that there are exactly 0 or 1 AwsAuth constructs per cluster. I think this would be a pretty straightforward change that would remove this footgun for other people. I'd be happy to send in such a PR if that sounds acceptable to you?

jfly commented 8 months ago

@pahud, any update here?

pahud commented 8 months ago

@jfly Totally agreed with you.

Yes, it sounds good to have a PR as the guardrail. Feel free to submit your PR for that. Thanks.

jfly commented 8 months ago

I just learned that this aws-auth ConfigMap is getting replaced with a new access entity and access policy mechanism: https://aws.amazon.com/blogs/containers/a-deep-dive-into-simplified-amazon-eks-access-management-controls/. We're going to put our effort into that, which should avoid the issue described here.

pahud commented 8 months ago

Yes, it's related to https://github.com/aws/aws-cdk/issues/28588

jsamuel1 commented 5 months ago

I just got hit by this issue and lost a full day of engineering for an entire team as we repaired all of the test clusters that this deployed to before we realised it broke everything.

Minimum viable fix - Please add a warning in the documentation to say: "Do not instantiate this class directly. Use the ICluster.awsAuth property", until you can prioritize the deprecation of the public constructor.

jsamuel1 commented 5 months ago

Of note - I was using Cdk Blueprints for eks to build the cluster in one stack, and my google search on how I can add awd auth permissions for my lambda in a related stack from cdk got me to the AwsAuth construct, rather than the instance of the class on ICluster.

aws / aws-cdk