aws-quickstart / cdk-eks-blueprints

AWS Quick Start Team
Apache License 2.0
447 stars 198 forks source link

Issue with cdk blueprint version 1.4 when running ClusterAutoScalerAddOn on Kubernetes 1.23 #531

Closed bnaydenov closed 1 year ago

bnaydenov commented 1 year ago

Describe the bug

When cdk blueprint version 1.4 is used and ClusterAutoScalerAddOn is installed on eks kubernetes 1.23 pod for blueprints-addon-cluster-autoscaler-aws-cluster-autoscaler is fails to start.

The same setup works without problems on eks kubernetes 1.21 and 1.22

Expected Behavior

When installing ClusterAutoScalerAddOn on eks kubernetes 1.23 pod for blueprints-addon-cluster-autoscaler-aws-cluster-autoscaler suppose to start without errors.

Current Behavior

pod blueprints-addon-cluster-autoscaler-aws-cluster-autoscaler starts and during starting phase which takes about 15-20 sec pod crash with following errors:

W1103 20:50:48.673793       1 warnings.go:70] policy/v1beta1 PodDisruptionBudget is deprecated in v1.21+, unavailable in v1.25+; use policy/v1 PodDisruptionBudget
 112 W1103 20:50:48.691943       1 warnings.go:70] policy/v1beta1 PodDisruptionBudget is deprecated in v1.21+, unavailable in v1.25+; use policy/v1 PodDisruptionBudget
 113 I1103 20:50:48.697788       1 cloud_provider_builder.go:29] Building aws cloud provider.
 114 I1103 20:50:48.697953       1 reflector.go:219] Starting reflector *v1.CSIDriver (0s) from k8s.io/client-go/informers/factory.go:134
 115 I1103 20:50:48.698079       1 reflector.go:255] Listing and watching *v1.CSIDriver from k8s.io/client-go/informers/factory.go:134
 116 I1103 20:50:48.698147       1 reflector.go:219] Starting reflector *v1.CSINode (0s) from k8s.io/client-go/informers/factory.go:134
 117 I1103 20:50:48.698298       1 reflector.go:255] Listing and watching *v1.CSINode from k8s.io/client-go/informers/factory.go:134
 118 I1103 20:50:48.698382       1 reflector.go:219] Starting reflector *v1.Namespace (0s) from k8s.io/client-go/informers/factory.go:134
 119 I1103 20:50:48.698397       1 reflector.go:255] Listing and watching *v1.Namespace from k8s.io/client-go/informers/factory.go:134
 120 I1103 20:50:48.698482       1 reflector.go:219] Starting reflector *v1.StatefulSet (0s) from k8s.io/client-go/informers/factory.go:134
 121 I1103 20:50:48.698490       1 reflector.go:255] Listing and watching *v1.StatefulSet from k8s.io/client-go/informers/factory.go:134
 122 I1103 20:50:48.698691       1 reflector.go:219] Starting reflector *v1beta1.CSIStorageCapacity (0s) from k8s.io/client-go/informers/factory.go:134
 123 I1103 20:50:48.698707       1 reflector.go:219] Starting reflector *v1.PersistentVolumeClaim (0s) from k8s.io/client-go/informers/factory.go:134
 124 I1103 20:50:48.698193       1 reflector.go:219] Starting reflector *v1.Service (0s) from k8s.io/client-go/informers/factory.go:134
 125 I1103 20:50:48.698713       1 reflector.go:255] Listing and watching *v1.PersistentVolumeClaim from k8s.io/client-go/informers/factory.go:134
 126 I1103 20:50:48.698717       1 reflector.go:255] Listing and watching *v1.Service from k8s.io/client-go/informers/factory.go:134
 127 I1103 20:50:48.698090       1 reflector.go:219] Starting reflector *v1.StorageClass (0s) from k8s.io/client-go/informers/factory.go:134
 128 I1103 20:50:48.698939       1 reflector.go:255] Listing and watching *v1.StorageClass from k8s.io/client-go/informers/factory.go:134
 129 I1103 20:50:48.699055       1 reflector.go:219] Starting reflector *v1.PodDisruptionBudget (0s) from k8s.io/client-go/informers/factory.go:134
 130 I1103 20:50:48.699069       1 reflector.go:255] Listing and watching *v1.PodDisruptionBudget from k8s.io/client-go/informers/factory.go:134
 131 I1103 20:50:48.699107       1 reflector.go:219] Starting reflector *v1.PersistentVolume (0s) from k8s.io/client-go/informers/factory.go:134
 132 I1103 20:50:48.699113       1 reflector.go:255] Listing and watching *v1.PersistentVolume from k8s.io/client-go/informers/factory.go:134
 133 I1103 20:50:48.698707       1 reflector.go:255] Listing and watching *v1beta1.CSIStorageCapacity from k8s.io/client-go/informers/factory.go:134
 134 I1103 20:50:48.698280       1 reflector.go:219] Starting reflector *v1.ReplicationController (0s) from k8s.io/client-go/informers/factory.go:134
 135 I1103 20:50:48.699200       1 reflector.go:255] Listing and watching *v1.ReplicationController from k8s.io/client-go/informers/factory.go:134
 136 I1103 20:50:48.698877       1 reflector.go:219] Starting reflector *v1.Node (0s) from k8s.io/client-go/informers/factory.go:134
 137 I1103 20:50:48.697981       1 reflector.go:219] Starting reflector *v1.Pod (0s) from k8s.io/client-go/informers/factory.go:134
 138 I1103 20:50:48.699249       1 reflector.go:255] Listing and watching *v1.Node from k8s.io/client-go/informers/factory.go:134
 139 I1103 20:50:48.698994       1 reflector.go:219] Starting reflector *v1.ReplicaSet (0s) from k8s.io/client-go/informers/factory.go:134
 140 I1103 20:50:48.699267       1 reflector.go:255] Listing and watching *v1.ReplicaSet from k8s.io/client-go/informers/factory.go:134
 141 I1103 20:50:48.699254       1 reflector.go:255] Listing and watching *v1.Pod from k8s.io/client-go/informers/factory.go:134
 142 F1103 20:50:48.774746       1 aws_cloud_provider.go:369] Failed to generate AWS EC2 Instance Types: UnauthorizedOperation: You are not authorized to perform this ope     ration.
 143         status code: 403, request id: daf5f899-2e44-4ffe-b66a-232afba2e473
 144 goroutine 48 [running]:
 145 k8s.io/klog/v2.stacks(0x1)
 146         /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/v2/klog.go:1038 +0x8a
 147 k8s.io/klog/v2.(*loggingT).output(0x611e4e0, 0x3, 0x0, 0xc00039b6c0, 0x0, {0x4d2e584, 0x1}, 0xc000f042a0, 0x0)
 148         /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/v2/klog.go:987 +0x5fd
 149 k8s.io/klog/v2.(*loggingT).printf(0x203000, 0x203000, 0x0, {0x0, 0x0}, {0x3cd1b88, 0x2d}, {0xc000f042a0, 0x1, 0x1})
 150         /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/v2/klog.go:753 +0x1c5
 151 k8s.io/klog/v2.Fatalf(...)
 152         /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/v2/klog.go:1532
 153 k8s.io/autoscaler/cluster-autoscaler/cloudprovider/aws.BuildAWS({{0x3fe0000000000000, 0x3fe0000000000000, 0x8bb2c97000, 0x1176592e000}, 0xa, 0x0, 0x4e200, 0x0, 0x186     a0000000000, 0x0, ...}, ...)
 154         /gopath/src/k8s.io/autoscaler/cluster-autoscaler/cloudprovider/aws/aws_cloud_provider.go:369 +0x3f7
 155 k8s.io/autoscaler/cluster-autoscaler/cloudprovider/builder.buildCloudProvider({{0x3fe0000000000000, 0x3fe0000000000000, 0x8bb2c97000, 0x1176592e000}, 0xa, 0x0, 0x4e2     00, 0x0, 0x186a0000000000, 0x0, ...}, ...)
 156         /gopath/src/k8s.io/autoscaler/cluster-autoscaler/cloudprovider/builder/builder_all.go:77 +0xea

The main error is Failed to generate AWS EC2 Instance Types: UnauthorizedOperation: You are not authorized to perform this operation.

Reproduction Steps

just use cdk blueprint 1.4 to spin up brand new eks k8s cluster 1.23 and using ClusterAutoScalerAddOn cdk deploy step will be successful, but after that pod blueprints-addon-cluster-autoscaler-aws-cluster-autoscaler will crash and can not be started.

Possible Solution

I have found what is wrong and will prepare PR to fix this.

TLDR: We need to add missing policy ec2:DescribeInstanceTypes for cluster-autoscaler IAM statements here:

https://github.com/aws-quickstart/cdk-eks-blueprints/blob/c03512bfd71ab735868da7c6acb43cfbb0e51dfe/lib/addons/cluster-autoscaler/index.ts#L79

For more info check here: https://github.com/kubernetes/autoscaler/issues/3216 https://github.com/particuleio/terraform-kubernetes-addons/pull/1320

I have created local monkey patch of addon/cluster-autoscaler in https://github.com/aws-quickstart/cdk-eks-blueprints/blob/c03512bfd71ab735868da7c6acb43cfbb0e51dfe/lib/addons/cluster-autoscaler/index.ts#L79 with adding this missing policy "ec2:DescribeInstanceTypes" and everything is working as expected.

Additional Information/Context

No response

CDK CLI Version

2.50.0 (build 4c11af6)

EKS Blueprints Version

1.4.0

Node.js Version

v16.17.0

Environment details (OS name and version, etc.)

Mac OS Monterey - Version 12.6

Other information

No response

bnaydenov commented 1 year ago

@shapirov103 you can close this one

softmates commented 1 year ago

@bnaydenov how is it working on 1.22 without the policy "ec2:DescribeInstanceTypes" ?

bnaydenov commented 1 year ago

@softmates most most likely is due to different version of the helm chart which autoscaler uses for different version of eks kubernetes. Check this file: https://github.com/aws-quickstart/cdk-eks-blueprints/blob/c03512bfd71ab735868da7c6acb43cfbb0e51dfe/lib/addons/cluster-autoscaler/index.ts#L44

eks k8s 1.21 and 1.22 are using the same helm chart version, but in 1.23 is different

const versionMap = new Map([
    [KubernetesVersion.V1_23, "9.21.0"],
    [KubernetesVersion.V1_22, "9.13.1"],
    [KubernetesVersion.V1_21, "9.13.1"],
    [KubernetesVersion.V1_20, "9.9.2"],
    [KubernetesVersion.V1_19, "9.4.0"],
    [KubernetesVersion.V1_18, "9.4.0"],
]);
softmates commented 1 year ago

Yep, it make sense. Looking at the spec don't see a need for additional policy "ec2:DescribeInstanceTypes" refer the comparison

https://artifacthub.io/packages/helm/cluster-autoscaler/cluster-autoscaler/9.13.0

spec: additionalPolicies: node: | [ {"Effect":"Allow","Action":["autoscaling:DescribeAutoScalingGroups","autoscaling:DescribeAutoScalingInstances","autoscaling:DescribeLaunchConfigurations","autoscaling:DescribeTags","autoscaling:SetDesiredCapacity","autoscaling:TerminateInstanceInAutoScalingGroup"],"Resource":"*"} ] ...

https://artifacthub.io/packages/helm/cluster-autoscaler/cluster-autoscaler/9.21.0

spec: additionalPolicies: node: | [ {"Effect":"Allow","Action":["autoscaling:DescribeAutoScalingGroups","autoscaling:DescribeAutoScalingInstances","autoscaling:DescribeLaunchConfigurations","autoscaling:DescribeTags","autoscaling:SetDesiredCapacity","autoscaling:TerminateInstanceInAutoScalingGroup"],"Resource":"*"} ] ...

bnaydenov commented 1 year ago

@softmates i was inspired from this issue:

https://github.com/kubernetes/autoscaler/issues/3216 more specifically this comment https://github.com/kubernetes/autoscaler/issues/3216#issuecomment-1047164006

bnaydenov commented 1 year ago

this changes are released in 1.4.1 so I will close the issue