Open sbstjn opened 3 months ago
@sbstjn your CDK version is set to 2.115 - is that correct? You should have received an error/warning when upgrading your version of the blueprints as we pin peerDependency to the exact version.
Please upgrade to 2.132.0.
I believe the issue is caused by the enable-windows-ipam
setting of VPC CNI which handles the modification of the aws-auth config map internally. That change is bypassing the CDK, in other words CDK is not aware of it.
CDK current behavior for aws-auth config map: accumulate all the modifications to the configmap and apply as a single document (as there is no patch command for it).
Potentially, we can look for an option to add this mapping in the blueprint with let's say a no-op team that only does the windows mapping part.
I guess it's just a mismatch of "cdk version" and "yarn cdk version." It's been a chaos with this problem π Will check later, but it should not affect this.
Having a generic "appendWindowsMapping: true/false" for the aws-auth generation could also do the trick. I'd rather like to have an explicit configuration than the current "magic update in the background."
In the worst case, I'd write a custom Lambda function and hook it to an event in EventBridge to overwrite the config after every potential change/deployment.
Currently, every CDK deployment has the potential risk of a corrupted group mapping and a degraded node group. Of course, this could also happen with plain kubectl commands β¦
It's not you, it's kubernetes. I know βΊοΈπ
@renukakrishnan Pls check this.
This messed up my EKS cluster again. A change that was totally unrelated to the Windows node did somehow update the values for aws-auth
and removed the eks:kube-proxy-windows
item.
Deleting/Recreating the node group triggered the "magic automated fix in the background."
Is there a way to enforce eks:kube-proxy-windows
in the config map?
@renukakrishnan Can you help on this, if you have bw.
Describe the bug
When the
aws-auth
entry in ConfigMap gets updated, the Windows-specificeks:kube-proxy-windows
group mapping may get removed and existing Windows node groups end up in an unhealthy/degraded state.Expected Behavior
If a cluster is configured to run Windows nodes and the
eks:kube-proxy-windows
group mapping exists, any updates toaws-auth
in ConfigMap must not overwrite the existing group mapping.Current Behavior
A CDK deployment may overwrite the
aws-auth
configuration. Any other AddOn may as well.Reproduction Steps
It's pretty simple to create a cluster with an unhealthy windows node group:
1.13.1
or1.14.0
ofeks-blueprint
platform
teamLuckily, it's pretty simple to fix it again:
aws-auth
configuration is updated witheks:kube-proxy-windows
mapping)Sadly, it's pretty simple to break it again:
aws-auth
configuration will remove/overwrite theeks:kube-proxy-windows
mapping)Luckily, it's pretty simple to fix it again:
β¦ you can continue this forever β¦
Possible Solution
Don't know. It's a nasty problem that one AddOn knows about the Windows-specific configuration and other AddOns naively overwrite the
aws-auth
configuration.With using CloudWatch Insights, you can identify the API request that "fixes" the
aws-auth
and re-adds the neededeks:kube-proxy-windows
mapping.I don't know about the EKS/k8s internals, but maybe it's somehow possible to "trigger" the fixing update without the need to create a new Windows node group.
Additional Information/Context
AddOn Configuration
The "broken"
aws-auth
results in an inline JSON string:Creating a new node group will recreate the
aws-auth
in nice yaml format:As soon as the node is ready, the
aws-auth
config is fixed:After a few minutes, both windows node groups are healthy again β¦
CDK CLI Version
2.115.0
EKS Blueprints Version
1.14.0
Node.js Version
v18.17.1
Environment details (OS name and version, etc.)
macOS
Other information
Thanks to EKS/CloudFormation update durations this is horrible to debug π