Open jfly opened 9 months ago
Thanks for the report.
Actually every Cluster comes with a built-in AwsAuth and you don't need to create a new one, instead, you should update the existing one like this as described in the doc:
cluster.awsAuth.addMastersRole(role);
Is there any reason you need to create another one for the existing cluster?
Is there any reason you need to create another one for the existing cluster?
@pahud, no. For more context: we recently had a complete outage of one of our clusters because we deployed a diff adding a new AwsAuth
construct to our cdk app. That definitely wasn't what we meant to do (as you said, we should have used cluster.awsAuth.addMastersRole
), but mistakes happen, and IMO this is very dangerous footgun for CDK to have.
I totally understand that it's always going to be possible for people to do dangerous things with an IAC tool (we could delete our cluster, for instance), but we are diligent about reviewing all code changes to our CDK app, and nobody at our company recognized that this diff would cause a cluster outage:
--- before
+++ after
@@ -14,3 +14,5 @@
nodegroup_name='demo-nodegroup',
instance_types=[aws_ec2.InstanceType('c6a.xlarge')],
)
+
+ aws_eks.AwsAuth(self, "problematic-aws-auth", cluster=cluster)
Running cdk diff
doesn't help either, you just see the addition of a custom resource, with no clue that this is going to do something very dangerous:
$ cdk diff
Stack DemoStack
Resources
[+] Custom::AWSCDK-EKS-KubernetesResource problematic-aws-auth/manifest/Resource problematicawsauthmanifestEB215835
In other words: it's totally ok for CDK to enable us to do dangerous things, but (IMO) those dangerous things need to at a least look dangerous. For example, deleting a cluster is very obviously a dangerous thing, both when you look at the code diff, and the cdk diff
. As an example, if the AwsAuth
class were instead renamed to something like DangerouslyClobberAwsAuthInYourClusterYouProbablyDoNotWantToConstructMeYourself
, then we probably wouldn't have had our outage =)
That all said, I still think it would be better for CDK to just ensure that there are exactly 0 or 1 AwsAuth
constructs per cluster. I think this would be a pretty straightforward change that would remove this footgun for other people. I'd be happy to send in such a PR if that sounds acceptable to you?
@pahud, any update here?
@jfly Totally agreed with you.
Yes, it sounds good to have a PR as the guardrail. Feel free to submit your PR for that. Thanks.
I just learned that this aws-auth ConfigMap is getting replaced with a new access entity and access policy mechanism: https://aws.amazon.com/blogs/containers/a-deep-dive-into-simplified-amazon-eks-access-management-controls/. We're going to put our effort into that, which should avoid the issue described here.
Yes, it's related to https://github.com/aws/aws-cdk/issues/28588
I just got hit by this issue and lost a full day of engineering for an entire team as we repaired all of the test clusters that this deployed to before we realised it broke everything.
Minimum viable fix - Please add a warning in the documentation to say: "Do not instantiate this class directly. Use the ICluster.awsAuth property", until you can prioritize the deprecation of the public constructor.
Of note - I was using Cdk Blueprints for eks to build the cluster in one stack, and my google search on how I can add awd auth permissions for my lambda in a related stack from cdk got me to the AwsAuth construct, rather than the instance of the class on ICluster.
Describe the bug
We recently added a new
AwsAuth
construct to our cdk app. We were shocked to find that this broke all our node groups (as the newAwsAuth
construct overwrote the aws-auth ConfigMap from cdk's internally managedAwsAuth
construct).Expected Behavior
I expect CDK to enforce that there is exactly one
AwsAuth
construct per cluster.Current Behavior
As mentioned above, the new
AwsAuth
construct clobbered the aws-auth ConfigMap managed by the cluster's internalAwsAuth
construct.Reproduction Steps
Here's a simple cdk app that declares a cluster, a nodegroup for that cluster, and an explicit
AwsAuth
for that cluster:Note how this produces 2
Custom::AWSCDK-EKS-KubernetesResource
resources. These two resources will happily step on each other's toes (because of this (unfortunately necessary)overwrite: true
).Possible Solution
I have read through https://github.com/aws/aws-cdk/issues/19218, which talks about a similar issue (the aws-auth ConfigMap getting unexpectedly clobbered), but that's all about using multiple tools to manage a cluster. In the scenario described here, we're using exactly 1 CDK app to manage our cluster, and IMO, CDK should protect us against this.
Some ideas for how to fix this:
AwsAuth
for a cluster, check if there's already aAwsAuth
instantiated for that cluster. If there is one already, crash.AwsAuth
construct at all, just keep it as an internal implementation detail of Cluster. As I understand things, it's never something you'd want to use directly when you've declared your cluster in cdk (instead you'd want to interact with the underlyingcluster.awsAuth
attribute as documented here).AwsAuth
constructs in the tree and asserts that they're for different clusters.Additional Information/Context
No response
CDK CLI Version
2.109.0 (build 941dc16)
Framework Version
No response
Node.js Version
v18.15.0
OS
N/A (this happens both on macOS and Linux)
Language
Python
Language Version
Python (3.10.6)
Other information
No response