Closed wesbragagt closed 1 month ago
Same thing for this node that has been running for 5 days
Name: ip-10-0-101-122.us-west-2.compute.internal
Roles: <none>
Labels: beta.kubernetes.io/arch=arm64
beta.kubernetes.io/instance-type=m6g.medium
beta.kubernetes.io/os=linux
eks.amazonaws.com/capacityType=SPOT
eks.amazonaws.com/nodegroup=controllers-20240728225511410500000002
eks.amazonaws.com/nodegroup-image=ami-0835c99467c24da9b
eks.amazonaws.com/sourceLaunchTemplateId=lt-04000b2f2434662ae
eks.amazonaws.com/sourceLaunchTemplateVersion=12
failure-domain.beta.kubernetes.io/region=us-west-2
failure-domain.beta.kubernetes.io/zone=us-west-2a
k8s.io/cloud-provider-aws=1eca48abf50de6dbb7b17d2b5d457797
kubernetes.io/arch=arm64
kubernetes.io/hostname=ip-10-0-101-122.us-west-2.compute.internal
kubernetes.io/os=linux
node.kubernetes.io/instance-type=m6g.medium
panfactum.com/class=controller
topology.ebs.csi.aws.com/zone=us-west-2a
topology.kubernetes.io/region=us-west-2
topology.kubernetes.io/zone=us-west-2a
Annotations: alpha.kubernetes.io/provided-node-ip: 10.0.101.122
csi.volume.kubernetes.io/nodeid:
{"ebs.csi.aws.com":"i-017fe4c94f695979a","secrets-store.csi.k8s.io":"ip-10-0-101-122.us-west-2.compute.internal"}
node.alpha.kubernetes.io/ttl: 0
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Wed, 09 Oct 2024 20:12:00 -0500
Taints: arm64=true:NoSchedule
burstable=true:NoSchedule
controller=true:NoSchedule
spot=true:NoSchedule
Unschedulable: false
Lease:
HolderIdentity: ip-10-0-101-122.us-west-2.compute.internal
AcquireTime: <unset>
RenewTime: Mon, 14 Oct 2024 20:23:57 -0500
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
NetworkUnavailable False Wed, 09 Oct 2024 20:12:29 -0500 Wed, 09 Oct 2024 20:12:29 -0500 CiliumIsUp Cilium is running on this node
MemoryPressure False Mon, 14 Oct 2024 20:20:38 -0500 Wed, 09 Oct 2024 20:12:00 -0500 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Mon, 14 Oct 2024 20:20:38 -0500 Wed, 09 Oct 2024 20:12:00 -0500 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Mon, 14 Oct 2024 20:20:38 -0500 Wed, 09 Oct 2024 20:12:00 -0500 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Mon, 14 Oct 2024 20:20:38 -0500 Wed, 09 Oct 2024 20:12:21 -0500 KubeletReady kubelet is posting ready status
Addresses:
InternalIP: 10.0.101.122
InternalDNS: ip-10-0-101-122.us-west-2.compute.internal
Hostname: ip-10-0-101-122.us-west-2.compute.internal
Capacity:
cpu: 1
ephemeral-storage: 40894Mi
hugepages-1Gi: 0
hugepages-2Mi: 0
hugepages-32Mi: 0
hugepages-64Ki: 0
memory: 3880624Ki
pods: 110
Allocatable:
cpu: 940m
ephemeral-storage: 37518678362
hugepages-1Gi: 0
hugepages-2Mi: 0
hugepages-32Mi: 0
hugepages-64Ki: 0
memory: 3163824Ki
pods: 110
System Info:
Machine ID: ec29c6b8ea0be97455edfd717a973a17
System UUID: ec29c6b8-ea0b-e974-55ed-fd717a973a17
Boot ID: d62b42a7-89cd-45d7-8e4d-3f1c863b4ffc
Kernel Version: 6.1.109
OS Image: Bottlerocket OS 1.24.0 (aws-k8s-1.29)
Operating System: linux
Architecture: arm64
Container Runtime Version: containerd://1.7.22+bottlerocket
Kubelet Version: v1.29.5-eks-1109419
Kube-Proxy Version: v1.29.5-eks-1109419
ProviderID: aws:///us-west-2a/i-017fe4c94f695979a
Non-terminated Pods: (20 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
--------- ---- ------------ ---------- --------------- ------------- ---
alloy alloy-lg4gp 34m (3%) 100m (10%) 179272160 (5%) 429137520 (13%) 3d3h
authentik redis-4833-node-2 56m (5%) 100m (10%) 107425154 (3%) 305613202 (9%) 9h
aws-ebs-csi-driver ebs-csi-node-jxj6c 33m (3%) 100m (10%) 81814506 (2%) 323841192 (9%) 9m33s
cicd eventbus-default-js-2 33m (3%) 100m (10%) 57060758 (1%) 270262697 (8%) 162m
cilium cilium-w9vwz 100m (10%) 0 (0%) 380258472 (11%) 494336013 (15%) 158m
external-snapshotter external-snapshotter-webhook-7d7c8c678d-ltx7c 11m (1%) 100m (10%) 34060758 (1%) 240362697 (7%) 85m
implentio eventbus-default-js-2 33m (3%) 100m (10%) 57060758 (1%) 270262697 (8%) 161m
ingress-nginx nginx-controller-dc497bb5d-tzklx 11m (1%) 100m (10%) 307649972 (9%) 596028675 (18%) 162m
karpenter karpenter-6bc74b4d46-kf759 271m (28%) 100m (10%) 559347396 (17%) 923235326 (28%) 85m
linkerd linkerd-proxy-injector-6d5778cb4d-7rh44 11m (1%) 100m (10%) 88707757 (2%) 366159194 (11%) 79m
logging loki-canary-dwxs5 11m (1%) 100m (10%) 41496628 (1%) 256845072 (7%) 2d17h
monitoring alertmanager-monitoring-1 10m (1%) 100m (10%) 210Mi (6%) 460Mi (14%) 85m
monitoring node-exporter-tt9lv 22m (2%) 0 (0%) 47149996 (1%) 61294994 (1%) 17h
monitoring thanos-bucketweb-84864f97d5-x5bhh 11m (1%) 100m (10%) 60052196 (1%) 274151566 (8%) 79m
reloader reloader-5ffcbc999b-t7qcc 11m (1%) 100m (10%) 210Mi (6%) 460Mi (14%) 3h35m
scheduler scheduler-596bc85f59-qjwg2 11m (1%) 100m (10%) 120300511 (3%) 484252077 (14%) 49m
secrets-csi secrets-csi-8rjqz 33m (3%) 264m (28%) 69135756 (2%) 285960194 (8%) 5h30m
vault vault-0 35m (3%) 100m (10%) 258639240 (7%) 532314724 (16%) 85m
vault vault-csi-provider-blbmw 22m (2%) 100m (10%) 58239508 (1%) 271795072 (8%) 25h
velero velero-5499cbcbcc-58r7n 23m (2%) 2300m (244%) 266Mi (8%) 712Mi (23%) 4d23h
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 782m (83%) 4164m (442%)
memory 3226994662 (99%) 8097128944 (249%)
ephemeral-storage 100Mi (0%) 100Mi (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
hugepages-32Mi 0 (0%) 0 (0%)
hugepages-64Ki 0 (0%) 0 (0%)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal DisruptionBlocked 101s (x41 over 84m) karpenter Cannot disrupt Node: state node doesn't contain both a node and a nodeclaim
This is normal as these are the EKS node group nodes, not nodes managed by Karpenter. You can tell this by the eks.amazonaws.com/nodegroup
annotation.
Karpenter still sees them but issues a DisruptionBlocked
event b/c it will not disrupt nodes that it did not create (with a NodeClaim).
Why do we have EKS node group nodes? Karpenter (and a handful of other controllers) cannot run on nodes managed by Karpenter.
Why do these nodes last for several days? These nodes are only replaced when the underlying OS receives an update and you re-run terragrunt apply
.
What does this mean for you? In later versions of the Stack, these nodes are tainted with controller=true
, so your pods will never be scheduled on them unless they tolerate this taint. As a result, their idiosyncrasies should never impact you. See https://panfactum.com/docs/edge/guides/deploying-workloads/basics#node-classes for more information.
Prior Search
What is your question?
I'm noticing a node in our production cluster with the following event
Cannot disrupt Node: state node doesn't contain both a node and a nodeclaim
which has been running for 3 days. Could this be related to this issue https://github.com/Panfactum/stack/issues/127 ?@mschnee also found this issue which seems related https://github.com/aws/karpenter-provider-aws/issues/6803
What primary components of the stack does this relate to?
terraform
Code of Conduct