Open WxFang opened 3 months ago
@jonathan-innis Thanks for joining the meeting, Jonathan! :) This ticket is to provide more details on unconsolidated empty nodes I mentioned in the meeting today.
Do you have any PDBs or pods with do-not-evict
annotations? Can you provide any Karpenter logs?
Please share events for the node as well
No PDBs for these daemonsets. And I didn't find any node events. Karpenter manages 3k+ nodes in this cluster and I see hundreds of nodes like this: empty with daemonsets only and no events from nodes or NodeClaims. Scheduling queue looks fine (< 100) and simulation is running. No specific loggings related. It seems like the cluster is too big for karpenter to scan all nodes. Normally after several hours, these nodes will eventually be consolidated. But obviously we expect much faster performance.
The original nodes were deleted. Here is a new one.
➜ ~ kubectl describe node ip-172-18-138-50.us-west-2.compute.internal
Name: ip-172-18-138-50.us-west-2.compute.internal
Roles: <none>
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/instance-type=c7i.48xlarge
beta.kubernetes.io/os=linux
failure-domain.beta.kubernetes.io/region=us-west-2
failure-domain.beta.kubernetes.io/zone=us-west-2b
k8s.io/cloud-provider-aws=5f7d5c3f339ac7f902cf53fa00268999
karpenter.k8s.aws/instance-category=c
karpenter.k8s.aws/instance-cpu=192
karpenter.k8s.aws/instance-cpu-manufacturer=intel
karpenter.k8s.aws/instance-ebs-bandwidth=40000
karpenter.k8s.aws/instance-encryption-in-transit-supported=true
karpenter.k8s.aws/instance-family=c7i
karpenter.k8s.aws/instance-generation=7
karpenter.k8s.aws/instance-hypervisor=nitro
karpenter.k8s.aws/instance-memory=393216
karpenter.k8s.aws/instance-network-bandwidth=50000
karpenter.k8s.aws/instance-size=48xlarge
karpenter.sh/capacity-type=on-demand
karpenter.sh/initialized=true
karpenter.sh/nodepool=compute
karpenter.sh/registered=true
kubernetes.io/arch=amd64
kubernetes.io/hostname=ip-172-18-138-50.us-west-2.compute.internal
kubernetes.io/os=linux
node.kubernetes.io/instance-type=c7i.48xlarge
nodepool=compute
topology.ebs.csi.aws.com/zone=us-west-2b
topology.k8s.aws/zone-id=usw2-az1
topology.kubernetes.io/region=us-west-2
topology.kubernetes.io/zone=us-west-2b
Annotations: alpha.kubernetes.io/provided-node-ip: 172.18.138.50
csi.volume.kubernetes.io/nodeid: {"ebs.csi.aws.com":"i-0b9ff61d337e5ade8"}
karpenter.k8s.aws/ec2nodeclass-hash: 7017015924687006381
karpenter.k8s.aws/ec2nodeclass-hash-version: v2
karpenter.sh/nodepool-hash: 5453664062116956107
karpenter.sh/nodepool-hash-version: v2
node.alpha.kubernetes.io/ttl: 300
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Fri, 26 Jul 2024 18:14:37 -0700
Taints: compute=true:NoSchedule
Unschedulable: false
Lease:
HolderIdentity: ip-172-18-138-50.us-west-2.compute.internal
AcquireTime: <unset>
RenewTime: Tue, 30 Jul 2024 10:26:42 -0700
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure False Tue, 30 Jul 2024 10:21:59 -0700 Fri, 26 Jul 2024 18:14:36 -0700 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Tue, 30 Jul 2024 10:21:59 -0700 Fri, 26 Jul 2024 18:14:36 -0700 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Tue, 30 Jul 2024 10:21:59 -0700 Fri, 26 Jul 2024 18:14:36 -0700 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Tue, 30 Jul 2024 10:21:59 -0700 Fri, 26 Jul 2024 18:14:55 -0700 KubeletReady kubelet is posting ready status
Addresses:
InternalIP: 172.18.138.50
InternalDNS: ip-172-18-138-50.us-west-2.compute.internal
Hostname: ip-172-18-138-50.us-west-2.compute.internal
Capacity:
cpu: 192
ephemeral-storage: 104845292Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 389937584Ki
pods: 737
Allocatable:
cpu: 191450m
ephemeral-storage: 95551679124
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 381272496Ki
pods: 737
Non-terminated Pods: (6 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
--------- ---- ------------ ---------- --------------- ------------- ---
kube-system aws-node-cglsb 50m (0%) 0 (0%) 0 (0%) 0 (0%) 3d16h
kube-system ebs-csi-node-b88ld 30m (0%) 0 (0%) 120Mi (0%) 768Mi (0%) 3d16h
kube-system kube-proxy-fs9bm 100m (0%) 0 (0%) 0 (0%) 0 (0%) 3d16h
logging vector-w7gqf 100m (0%) 3 (1%) 256Mi (0%) 1Gi (0%) 3d16h
services jaeger-service-agent-daemonset-b8qbr 0 (0%) 0 (0%) 0 (0%) 0 (0%) 3d16h
services otel-traceagent-opentelemetry-collector-agent-cmmn4 0 (0%) 0 (0%) 0 (0%) 0 (0%) 3d16h
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 280m (0%) 3 (1%)
memory 376Mi (0%) 1792Mi (0%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
Events: <none>
➜ ~ kubectl get -A pod --field-selector spec.nodeName=ip-172-18-138-50.us-west-2.compute.internal | grep Running | awk '{print "kubectl get pod -n " $1 " " $2 " -o json"}' | bash | jq '.metadata'
{
"annotations": {
"artifact.spinnaker.io/location": "kube-system",
"artifact.spinnaker.io/name": "aws-node",
"artifact.spinnaker.io/type": "kubernetes/daemonSet",
"artifact.spinnaker.io/version": "",
"moniker.spinnaker.io/application": "aws-cni",
"moniker.spinnaker.io/cluster": "daemonSet aws-node",
"prometheus.io/instance": "default",
"prometheus.io/port": "61678",
"prometheus.io/scrape": "true"
},
"creationTimestamp": "2024-07-27T01:14:37Z",
"generateName": "aws-node-",
"labels": {
"app.kubernetes.io/instance": "aws-vpc-cni",
"app.kubernetes.io/managed-by": "spinnaker",
"app.kubernetes.io/name": "aws-node",
"apple_usr_app_name": "aws-node",
"controller-revision-hash": "64cf77bcd8",
"k8s-app": "aws-node",
"pod-template-generation": "15"
},
"name": "aws-node-cglsb",
"namespace": "kube-system",
"ownerReferences": [
{
"apiVersion": "apps/v1",
"blockOwnerDeletion": true,
"controller": true,
"kind": "DaemonSet",
"name": "aws-node",
"uid": "2e207144-33f5-40f7-bd18-3d2e38607785"
}
],
"resourceVersion": "15242243124",
"uid": "b6609dc8-6723-4450-a255-8f6c771b460e"
}
{
"creationTimestamp": "2024-07-27T01:14:37Z",
"generateName": "ebs-csi-node-",
"labels": {
"app": "ebs-csi-node",
"app.kubernetes.io/component": "csi-driver",
"app.kubernetes.io/instance": "aws-ebs-csi-driver",
"app.kubernetes.io/managed-by": "Helm",
"app.kubernetes.io/name": "aws-ebs-csi-driver",
"app.kubernetes.io/version": "1.28.0",
"controller-revision-hash": "7f445f486",
"helm.sh/chart": "aws-ebs-csi-driver-2.28.1",
"pod-template-generation": "11"
},
"name": "ebs-csi-node-b88ld",
"namespace": "kube-system",
"ownerReferences": [
{
"apiVersion": "apps/v1",
"blockOwnerDeletion": true,
"controller": true,
"kind": "DaemonSet",
"name": "ebs-csi-node",
"uid": "44c1bf19-64dd-44a3-a4dc-3d650935d4b6"
}
],
"resourceVersion": "15242238710",
"uid": "5e035eb3-683e-4ca2-8ea6-e876eed24253"
}
{
"annotations": {
"artifact.spinnaker.io/location": "kube-system",
"artifact.spinnaker.io/name": "kube-proxy",
"artifact.spinnaker.io/type": "kubernetes/daemonSet",
"artifact.spinnaker.io/version": "",
"moniker.spinnaker.io/application": "kube-proxy",
"moniker.spinnaker.io/cluster": "daemonSet kube-proxy",
"prometheus.io/instance": "default",
"prometheus.io/port": "10249",
"prometheus.io/scrape": "true"
},
"creationTimestamp": "2024-07-27T01:14:37Z",
"generateName": "kube-proxy-",
"labels": {
"app.kubernetes.io/managed-by": "spinnaker",
"app.kubernetes.io/name": "kube-proxy",
"apple_usr_app_name": "kube-proxy",
"controller-revision-hash": "5dd844d6cd",
"k8s-app": "kube-proxy",
"pod-template-generation": "7"
},
"name": "kube-proxy-fs9bm",
"namespace": "kube-system",
"ownerReferences": [
{
"apiVersion": "apps/v1",
"blockOwnerDeletion": true,
"controller": true,
"kind": "DaemonSet",
"name": "kube-proxy",
"uid": "81fc1247-81e7-4882-aae1-57900c696952"
}
],
"resourceVersion": "15242236239",
"uid": "d371f8c2-a3d6-4c40-b987-eaaf8f5e253f"
}
{
"annotations": {
"cluster-autoscaler.kubernetes.io/safe-to-evict": "true",
"prometheus.io/port": "9090",
"prometheus.io/scrape": "true"
},
"creationTimestamp": "2024-07-27T01:14:37Z",
"generateName": "vector-",
"labels": {
"app": "vector",
"app.kubernetes.io/component": "agent",
"app.kubernetes.io/instance": "vector",
"app.kubernetes.io/name": "vector",
"controller-revision-hash": "978979dcd",
"pod-template-generation": "2",
"vector.dev/exclude": "false"
},
"name": "vector-w7gqf",
"namespace": "logging",
"ownerReferences": [
{
"apiVersion": "apps/v1",
"blockOwnerDeletion": true,
"controller": true,
"kind": "DaemonSet",
"name": "vector",
"uid": "93a9193c-0856-4a51-b904-fd51ef0cf340"
}
],
"resourceVersion": "15242261286",
"uid": "05ffc310-52ee-49e4-ae26-363ecd3c7ca5"
}
{
"annotations": {
"linkerd.io/inject": "disabled",
"prometheus.io/port": "14271",
"prometheus.io/scrape": "true",
"sidecar.istio.io/inject": "false"
},
"creationTimestamp": "2024-07-27T01:14:37Z",
"generateName": "jaeger-service-agent-daemonset-",
"labels": {
"app": "jaeger",
"app.kubernetes.io/component": "agent",
"app.kubernetes.io/instance": "jaeger-service",
"app.kubernetes.io/managed-by": "jaeger-operator",
"app.kubernetes.io/name": "jaeger-service-agent",
"app.kubernetes.io/part-of": "jaeger",
"apple_usr_app_id": "monitoring",
"apple_usr_app_name": "jaeger",
"controller-revision-hash": "86755ddb5",
"pod-template-generation": "3"
},
"name": "jaeger-service-agent-daemonset-b8qbr",
"namespace": "services",
"ownerReferences": [
{
"apiVersion": "apps/v1",
"blockOwnerDeletion": true,
"controller": true,
"kind": "DaemonSet",
"name": "jaeger-service-agent-daemonset",
"uid": "a131462d-145e-49c1-9461-dca6e297e985"
}
],
"resourceVersion": "15242236317",
"uid": "9edca4d3-6462-4c8c-bc47-1c7b6b0d621e"
}
{
"annotations": {
"artifact.spinnaker.io/location": "services",
"artifact.spinnaker.io/name": "otel-traceagent-opentelemetry-collector-agent",
"artifact.spinnaker.io/type": "kubernetes/daemonSet",
"artifact.spinnaker.io/version": "",
"checksum/config": "4dcf49020200023fb25d1b73d42a42305493af3ba364678cd24b0917f4356260",
"moniker.spinnaker.io/application": "opentelemetry-collector",
"moniker.spinnaker.io/cluster": "daemonSet otel-traceagent-opentelemetry-collector-agent",
"prometheus.io/port": "8888",
"prometheus.io/scrape": "true"
},
"creationTimestamp": "2024-07-27T01:14:37Z",
"generateName": "otel-traceagent-opentelemetry-collector-agent-",
"labels": {
"app": "otel-traceagent",
"app.kubernetes.io/instance": "otel-traceagent",
"app.kubernetes.io/managed-by": "spinnaker",
"app.kubernetes.io/name": "opentelemetry-collector",
"component": "agent-collector",
"controller-revision-hash": "866cd4858f",
"pod-template-generation": "9",
"serviceSelector": "otel-traceagent"
},
"name": "otel-traceagent-opentelemetry-collector-agent-cmmn4",
"namespace": "services",
"ownerReferences": [
{
"apiVersion": "apps/v1",
"blockOwnerDeletion": true,
"controller": true,
"kind": "DaemonSet",
"name": "otel-traceagent-opentelemetry-collector-agent",
"uid": "5893ff74-b3d7-4672-a820-d35e144d4d2d"
}
],
"resourceVersion": "15242236463",
"uid": "2ef661b7-cea7-47ba-bd15-c832af8e1552"
}
Description
Observed Behavior: Nodes have been running for 15h without actual workloads. Only daemonset pods are running in it. Expected Behavior: Karpenter deletes the underutilized nodes. Reproduction Steps (Please include YAML):
Nodepool Spec
Nodeclaim Spec
Versions:
Chart Version: 0.37.0
Kubernetes Version (
kubectl version
): 1.29Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment