aws-quickstart / cdk-eks-blueprints

AWS Quick Start Team
Apache License 2.0
446 stars 198 forks source link

(module name): (short issue description) #1053

Open NicoleY666 opened 1 month ago

NicoleY666 commented 1 month ago

Describe the bug

Here is the addon deployment:

const region = "ap-southeast-2" const karpenterAddOn = new blueprints.addons.KarpenterAddOn({ version: '0.35.5', nodePoolSpec: { requirements: [ { key: 'node.kubernetes.io/instance-type', operator: 'In', values: ["c5d.4xlarge","c6a.2xlarge","c6a.4xlarge","c6a.8xlarge","c6a.16xlarge"] }, { key: 'topology.kubernetes.io/zone', operator: 'In', values: [${region}a, ${region}b, ${region}c] }, { key: 'kubernetes.io/arch', operator: 'In', values: ['amd64','arm64']}, { key: 'karpenter.sh/capacity-type', operator: 'In', values: ['on-demand']}, ], disruption: { consolidationPolicy: "WhenEmpty", consolidateAfter: "30s", expireAfter: "72h", budgets: [{nodes: "10%"}] } }, ec2NodeClassSpec: { amiFamily: "AL2", subnetSelectorTerms: [{ tags: {"ops:repo":xxxx} }], securityGroupSelectorTerms: [{ tags: {"aws:eks:cluster-name": 'xxxxx'} }], }, interruptionHandling: true, podIdentity: false, });

const addOns: Array<blueprints.ClusterAddOn> = [
  new blueprints.addons.CalicoOperatorAddOn(),
  new blueprints.addons.MetricsServerAddOn(),
  new blueprints.addons.AwsLoadBalancerControllerAddOn({
    enableWaf: false,
    version: mapping[env].helmChartVersion,
  }),
  new blueprints.addons.VpcCniAddOn(),
  new blueprints.addons.CoreDnsAddOn(),
  new blueprints.addons.KubeProxyAddOn(),
  new blueprints.addons.SSMAgentAddOn(),
  new blueprints.addons.CloudWatchInsights(),
  karpenterAddOn
];

Attach with the new node without deployments

Screenshot 2024-08-03 at 9 53 23 PM

image

Expected Behavior

I want the karpenter scale up and down automatically as a smart autoscaler

Current Behavior

There has three node at the same time NAME↑ STATUS ROLE TAINTS VERSION PODS CPU MEM %CPU %MEM CPU/A MEM/A AGE ip-10-60-63-193.ap-southeast-2.compute.internal Ready 0 v1.29.3-eks-ae9a62a 14 113 1866 0 6 15890 28360 32h ip-10-60-74-255.ap-southeast-2.compute.internal Ready 0 v1.29.3-eks-ae9a62a 16 130 1755 0 6 15890 28360 31h ip-10-60-95-42.ap-southeast-2.compute.internal Ready 3 v1.29.6-eks-1552ad0 9 0 0 0 0 7910 14640 8h

The third node is not scale down as expected, even thought there has no new deployment. when I checked the pod in the third node. there has a calico-system calico-typha-988d6c9c5-fh55r (which is not a daemonset), which is blocking the karpenter scale down the node. but this pod is deployed by CalicoOperatorAddOn(). which create three pods calico-system calico-node-jnx6c. (daemonsets) calico-system calico-typha-988d6c9c5-fh55r (deployment) calico-system csi-node-driver-ld48d (daemonsets)

As the calico-typha is created by addon, i don't know how to make the karpenter scale down as expected.

Reproduction Steps

the code is deployed above.

Possible Solution

No response

Additional Information/Context

No response

CDK CLI Version

2.147.3

EKS Blueprints Version

1.15.1

Node.js Version

20

Environment details (OS name and version, etc.)

EKS

Other information

No response

shapirov103 commented 1 month ago

@NicoleY666

CalicoOperator addon deploys the operator only. That component then deploys the pods that you mentioned. This functionality is controlled by calico.

Let me understand the issue: you have a node with calico pods running. Your screenshots show the pods running on ip-10-60-95-42. Is that the node that you want to scale down or are there any nodes with no pods which are not scaled down?

Calico CNI is not a component that we support functionally. While we support provisioning of that component (operator), the actual software is maintained by the Calico community (or Tigera for enterprise support). In general, CNI components are considered to be mission critical and may have specific disruption rules applied.

NicoleY666 commented 1 month ago

Yes. You are correct, but the pod is blocking the node scale down as the node is not empty (except daemonset), is there a way that we can except some deployment and make it scale down? As the policy is whenempty (so if only daemonset pod running in the node, then the node can be scale down. However, the pod is created as deployment which is control by calicoOperator addon, I cant make it as daemonset. As the pod calico-typha in the node, it can't mark it as empty to scale down. If you can see the screenshot attach, it shouldn't have 4 node with less than 20% usage.

image image

but it also have the node with daemonset only not scale down as expected, not quite sure why image

here is the daemonset screenshot image