Panfactum / stack

The Panfactum Stack
https://panfactum.com
Other
16 stars 5 forks source link

[question]: Cannot disrupt Node: state node doesn't contain both a node and a nodeclaim #163

Closed wesbragagt closed 1 month ago

wesbragagt commented 1 month ago

Prior Search

What is your question?

I'm noticing a node in our production cluster with the following event Cannot disrupt Node: state node doesn't contain both a node and a nodeclaim which has been running for 3 days. Could this be related to this issue https://github.com/Panfactum/stack/issues/127 ?

@mschnee also found this issue which seems related https://github.com/aws/karpenter-provider-aws/issues/6803

Name:               ip-10-0-166-82.us-west-2.compute.internal
Roles:              <none>
Labels:             beta.kubernetes.io/arch=arm64
                    beta.kubernetes.io/instance-type=m6g.medium
                    beta.kubernetes.io/os=linux
                    eks.amazonaws.com/capacityType=SPOT
                    eks.amazonaws.com/nodegroup=controllers-20240728225511410500000002
                    eks.amazonaws.com/nodegroup-image=ami-0835c99467c24da9b
                    eks.amazonaws.com/sourceLaunchTemplateId=lt-04000b2f2434662ae
                    eks.amazonaws.com/sourceLaunchTemplateVersion=12
                    failure-domain.beta.kubernetes.io/region=us-west-2
                    failure-domain.beta.kubernetes.io/zone=us-west-2b
                    k8s.io/cloud-provider-aws=1eca48abf50de6dbb7b17d2b5d457797
                    kubernetes.io/arch=arm64
                    kubernetes.io/hostname=ip-10-0-166-82.us-west-2.compute.internal
                    kubernetes.io/os=linux
                    node.kubernetes.io/instance-type=m6g.medium
                    panfactum.com/class=controller
                    topology.ebs.csi.aws.com/zone=us-west-2b
                    topology.kubernetes.io/region=us-west-2
                    topology.kubernetes.io/zone=us-west-2b
Annotations:        alpha.kubernetes.io/provided-node-ip: 10.0.166.82
                    csi.volume.kubernetes.io/nodeid:
                      {"ebs.csi.aws.com":"i-028527e376b17a21e","secrets-store.csi.k8s.io":"ip-10-0-166-82.us-west-2.compute.internal"}
                    node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Fri, 11 Oct 2024 08:15:46 -0500
Taints:             arm64=true:NoSchedule
                    burstable=true:NoSchedule
                    controller=true:NoSchedule
                    spot=true:NoSchedule
Unschedulable:      false
Lease:
  HolderIdentity:  ip-10-0-166-82.us-west-2.compute.internal
  AcquireTime:     <unset>
  RenewTime:       Mon, 14 Oct 2024 20:20:39 -0500
Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  NetworkUnavailable   False   Fri, 11 Oct 2024 08:16:14 -0500   Fri, 11 Oct 2024 08:16:14 -0500   CiliumIsUp                   Cilium is running on this node
  MemoryPressure       False   Mon, 14 Oct 2024 20:16:59 -0500   Fri, 11 Oct 2024 08:15:45 -0500   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Mon, 14 Oct 2024 20:16:59 -0500   Fri, 11 Oct 2024 08:15:45 -0500   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   Mon, 14 Oct 2024 20:16:59 -0500   Fri, 11 Oct 2024 08:15:45 -0500   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                True    Mon, 14 Oct 2024 20:16:59 -0500   Fri, 11 Oct 2024 08:16:06 -0500   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:   10.0.166.82
  InternalDNS:  ip-10-0-166-82.us-west-2.compute.internal
  Hostname:     ip-10-0-166-82.us-west-2.compute.internal
Capacity:
  cpu:                1
  ephemeral-storage:  40894Mi
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  hugepages-32Mi:     0
  hugepages-64Ki:     0
  memory:             3880624Ki
  pods:               110
Allocatable:
  cpu:                940m
  ephemeral-storage:  37518678362
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  hugepages-32Mi:     0
  hugepages-64Ki:     0
  memory:             3163824Ki
  pods:               110
System Info:
  Machine ID:                 ec2ade84bc798e1284d85a506964467e
  System UUID:                ec2ade84-bc79-8e12-84d8-5a506964467e
  Boot ID:                    0ca4357a-7367-45d9-b8c4-7d3c7cae8d98
  Kernel Version:             6.1.109
  OS Image:                   Bottlerocket OS 1.24.0 (aws-k8s-1.29)
  Operating System:           linux
  Architecture:               arm64
  Container Runtime Version:  containerd://1.7.22+bottlerocket
  Kubelet Version:            v1.29.5-eks-1109419
  Kube-Proxy Version:         v1.29.5-eks-1109419
ProviderID:                   aws:///us-west-2b/i-028527e376b17a21e
Non-terminated Pods:          (29 in total)
  Namespace                   Name                                                     CPU Requests  CPU Limits    Memory Requests  Memory Limits    Age
  ---------                   ----                                                     ------------  ----------    ---------------  -------------    ---
  alb-controller              alb-controller-dd56b78d6-whkgc                           11m (1%)      100m (10%)    83464877 (2%)    312005503 (9%)   25m
  alloy                       alloy-tnqng                                              34m (3%)      100m (10%)    179272160 (5%)   429137520 (13%)  3d
  argo                        argo-events-controller-manager-68487594-rr7jb            11m (1%)      100m (10%)    88707757 (2%)    334870395 (10%)  81m
  argo                        events-webhook-6d98c7b976-rcc25                          11m (1%)      100m (10%)    46739508 (1%)    267721196 (8%)   25m
  authentik                   redis-4833-node-0                                        56m (5%)      100m (10%)    107425154 (3%)   305613202 (9%)   8m59s
  aws-ebs-csi-driver          ebs-csi-controller-676849595f-rzg5x                      66m (7%)      100m (10%)    177293248 (5%)   465731052 (14%)  79m
  aws-ebs-csi-driver          ebs-csi-node-bwh2g                                       33m (3%)      100m (10%)    81814506 (2%)    323841192 (9%)   6m16s
  cert-manager                cert-manager-cainjector-6f67f8649c-mknt8                 11m (1%)      100m (10%)    155131523 (4%)   397754691 (12%)  6m17s
  cert-manager                cert-manager-webhook-66db579977-xfgnw                    11m (1%)      100m (10%)    34060758 (1%)    240362697 (7%)   25m
  cilium                      cilium-xqp62                                             100m (10%)    0 (0%)        380258472 (11%)  494336013 (15%)  155m
  cloudnative-pg              cloudnative-pg-787ff9548d-lq79d                          11m (1%)      100m (10%)    155131523 (4%)   397754691 (12%)  158m
  external-snapshotter        external-snapshotter-webhook-7d7c8c678d-6hvxg            11m (1%)      100m (10%)    34060758 (1%)    240362697 (7%)   76m
  implentio                   eventbus-default-js-0                                    33m (3%)      100m (10%)    57060758 (1%)    270262697 (8%)   9m17s
  kube-system                 core-dns-664d5dfc4f-bqdxs                                34m (3%)      0 (0%)        99798506 (3%)    129738057 (4%)   93m
  linkerd                     linkerd-identity-69bb59b957-n4zlb                        22m (2%)      100m (10%)    35074998 (1%)    253574998 (7%)   158m
  linkerd                     linkerd-proxy-injector-6d5778cb4d-q9dtl                  11m (1%)      100m (10%)    74030518 (2%)    336804716 (10%)  3h32m
  logging                     loki-backend-2                                           22m (2%)      100m (10%)    307818158 (9%)   474524292 (14%)  150m
  logging                     loki-canary-grzjp                                        11m (1%)      100m (10%)    41496628 (1%)    256845072 (7%)   3d12h
  logging                     loki-read-7f98fd5b98-n6fn7                               23m (2%)      100m (10%)    235870026 (7%)   502714745 (15%)  69m
  logging                     redis-de1a-node-2                                        56m (5%)      100m (10%)    121564556 (3%)   339167634 (10%)  6h5m
  metabase                    pg-bce1-pooler-rw-6c687bcc4-cgktr                        10m (1%)      100m (10%)    60Mi (1%)        280Mi (9%)       5h21m
  monitoring                  node-exporter-gm8dh                                      22m (2%)      0 (0%)        47149996 (1%)    61294994 (1%)    17h
  monitoring                  oauth2-proxy-ec6215c0214caf95-5f5f4c6dbb-bxh25           11m (1%)      100m (10%)    28817878 (0%)    51619017 (1%)    8m29s
  monitoring                  open-telemetry-opentelemetry-operator-9c9f7f65c-t8xwd    22m (2%)      1111m (118%)  83627194 (2%)    355998068 (10%)  85m
  pvc-autoresizer             pvc-autoresizer-controller-5775b9dfff-nntft              23m (2%)      100m (10%)    46739508 (1%)    256845072 (7%)   81m
  secrets-csi                 secrets-csi-h4qgs                                        33m (3%)      264m (28%)    69135756 (2%)    285960194 (8%)   5h25m
  vault                       vault-2                                                  35m (3%)      100m (10%)    258639240 (7%)   532314724 (16%)  85m
  vault                       vault-csi-provider-2dsjz                                 22m (2%)      100m (10%)    58239508 (1%)    271795072 (8%)   25h
  vertical-pod-autoscaler     vpa-admission-controller-5584bfb85d-jcjlw                11m (1%)      100m (10%)    74030518 (2%)    292323385 (9%)   66m
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests          Limits
  --------           --------          ------
  cpu                767m (81%)        3775m (401%)
  memory             3225368550 (99%)  9174874866 (283%)
  ephemeral-storage  200Mi (0%)        200Mi (0%)
  hugepages-1Gi      0 (0%)            0 (0%)
  hugepages-2Mi      0 (0%)            0 (0%)
  hugepages-32Mi     0 (0%)            0 (0%)
  hugepages-64Ki     0 (0%)            0 (0%)
Events:
  Type    Reason             Age                 From       Message
  ----    ------             ----                ----       -------
  Normal  DisruptionBlocked  30s (x40 over 81m)  karpenter  Cannot disrupt Node: state node doesn't contain both a node and a nodeclaim

What primary components of the stack does this relate to?

terraform

Code of Conduct

wesbragagt commented 1 month ago

Same thing for this node that has been running for 5 days

Name:               ip-10-0-101-122.us-west-2.compute.internal
Roles:              <none>
Labels:             beta.kubernetes.io/arch=arm64
                    beta.kubernetes.io/instance-type=m6g.medium
                    beta.kubernetes.io/os=linux
                    eks.amazonaws.com/capacityType=SPOT
                    eks.amazonaws.com/nodegroup=controllers-20240728225511410500000002
                    eks.amazonaws.com/nodegroup-image=ami-0835c99467c24da9b
                    eks.amazonaws.com/sourceLaunchTemplateId=lt-04000b2f2434662ae
                    eks.amazonaws.com/sourceLaunchTemplateVersion=12
                    failure-domain.beta.kubernetes.io/region=us-west-2
                    failure-domain.beta.kubernetes.io/zone=us-west-2a
                    k8s.io/cloud-provider-aws=1eca48abf50de6dbb7b17d2b5d457797
                    kubernetes.io/arch=arm64
                    kubernetes.io/hostname=ip-10-0-101-122.us-west-2.compute.internal
                    kubernetes.io/os=linux
                    node.kubernetes.io/instance-type=m6g.medium
                    panfactum.com/class=controller
                    topology.ebs.csi.aws.com/zone=us-west-2a
                    topology.kubernetes.io/region=us-west-2
                    topology.kubernetes.io/zone=us-west-2a
Annotations:        alpha.kubernetes.io/provided-node-ip: 10.0.101.122
                    csi.volume.kubernetes.io/nodeid:
                      {"ebs.csi.aws.com":"i-017fe4c94f695979a","secrets-store.csi.k8s.io":"ip-10-0-101-122.us-west-2.compute.internal"}
                    node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Wed, 09 Oct 2024 20:12:00 -0500
Taints:             arm64=true:NoSchedule
                    burstable=true:NoSchedule
                    controller=true:NoSchedule
                    spot=true:NoSchedule
Unschedulable:      false
Lease:
  HolderIdentity:  ip-10-0-101-122.us-west-2.compute.internal
  AcquireTime:     <unset>
  RenewTime:       Mon, 14 Oct 2024 20:23:57 -0500
Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  NetworkUnavailable   False   Wed, 09 Oct 2024 20:12:29 -0500   Wed, 09 Oct 2024 20:12:29 -0500   CiliumIsUp                   Cilium is running on this node
  MemoryPressure       False   Mon, 14 Oct 2024 20:20:38 -0500   Wed, 09 Oct 2024 20:12:00 -0500   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Mon, 14 Oct 2024 20:20:38 -0500   Wed, 09 Oct 2024 20:12:00 -0500   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   Mon, 14 Oct 2024 20:20:38 -0500   Wed, 09 Oct 2024 20:12:00 -0500   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                True    Mon, 14 Oct 2024 20:20:38 -0500   Wed, 09 Oct 2024 20:12:21 -0500   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:   10.0.101.122
  InternalDNS:  ip-10-0-101-122.us-west-2.compute.internal
  Hostname:     ip-10-0-101-122.us-west-2.compute.internal
Capacity:
  cpu:                1
  ephemeral-storage:  40894Mi
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  hugepages-32Mi:     0
  hugepages-64Ki:     0
  memory:             3880624Ki
  pods:               110
Allocatable:
  cpu:                940m
  ephemeral-storage:  37518678362
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  hugepages-32Mi:     0
  hugepages-64Ki:     0
  memory:             3163824Ki
  pods:               110
System Info:
  Machine ID:                 ec29c6b8ea0be97455edfd717a973a17
  System UUID:                ec29c6b8-ea0b-e974-55ed-fd717a973a17
  Boot ID:                    d62b42a7-89cd-45d7-8e4d-3f1c863b4ffc
  Kernel Version:             6.1.109
  OS Image:                   Bottlerocket OS 1.24.0 (aws-k8s-1.29)
  Operating System:           linux
  Architecture:               arm64
  Container Runtime Version:  containerd://1.7.22+bottlerocket
  Kubelet Version:            v1.29.5-eks-1109419
  Kube-Proxy Version:         v1.29.5-eks-1109419
ProviderID:                   aws:///us-west-2a/i-017fe4c94f695979a
Non-terminated Pods:          (20 in total)
  Namespace                   Name                                             CPU Requests  CPU Limits    Memory Requests  Memory Limits    Age
  ---------                   ----                                             ------------  ----------    ---------------  -------------    ---
  alloy                       alloy-lg4gp                                      34m (3%)      100m (10%)    179272160 (5%)   429137520 (13%)  3d3h
  authentik                   redis-4833-node-2                                56m (5%)      100m (10%)    107425154 (3%)   305613202 (9%)   9h
  aws-ebs-csi-driver          ebs-csi-node-jxj6c                               33m (3%)      100m (10%)    81814506 (2%)    323841192 (9%)   9m33s
  cicd                        eventbus-default-js-2                            33m (3%)      100m (10%)    57060758 (1%)    270262697 (8%)   162m
  cilium                      cilium-w9vwz                                     100m (10%)    0 (0%)        380258472 (11%)  494336013 (15%)  158m
  external-snapshotter        external-snapshotter-webhook-7d7c8c678d-ltx7c    11m (1%)      100m (10%)    34060758 (1%)    240362697 (7%)   85m
  implentio                   eventbus-default-js-2                            33m (3%)      100m (10%)    57060758 (1%)    270262697 (8%)   161m
  ingress-nginx               nginx-controller-dc497bb5d-tzklx                 11m (1%)      100m (10%)    307649972 (9%)   596028675 (18%)  162m
  karpenter                   karpenter-6bc74b4d46-kf759                       271m (28%)    100m (10%)    559347396 (17%)  923235326 (28%)  85m
  linkerd                     linkerd-proxy-injector-6d5778cb4d-7rh44          11m (1%)      100m (10%)    88707757 (2%)    366159194 (11%)  79m
  logging                     loki-canary-dwxs5                                11m (1%)      100m (10%)    41496628 (1%)    256845072 (7%)   2d17h
  monitoring                  alertmanager-monitoring-1                        10m (1%)      100m (10%)    210Mi (6%)       460Mi (14%)      85m
  monitoring                  node-exporter-tt9lv                              22m (2%)      0 (0%)        47149996 (1%)    61294994 (1%)    17h
  monitoring                  thanos-bucketweb-84864f97d5-x5bhh                11m (1%)      100m (10%)    60052196 (1%)    274151566 (8%)   79m
  reloader                    reloader-5ffcbc999b-t7qcc                        11m (1%)      100m (10%)    210Mi (6%)       460Mi (14%)      3h35m
  scheduler                   scheduler-596bc85f59-qjwg2                       11m (1%)      100m (10%)    120300511 (3%)   484252077 (14%)  49m
  secrets-csi                 secrets-csi-8rjqz                                33m (3%)      264m (28%)    69135756 (2%)    285960194 (8%)   5h30m
  vault                       vault-0                                          35m (3%)      100m (10%)    258639240 (7%)   532314724 (16%)  85m
  vault                       vault-csi-provider-blbmw                         22m (2%)      100m (10%)    58239508 (1%)    271795072 (8%)   25h
  velero                      velero-5499cbcbcc-58r7n                          23m (2%)      2300m (244%)  266Mi (8%)       712Mi (23%)      4d23h
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests          Limits
  --------           --------          ------
  cpu                782m (83%)        4164m (442%)
  memory             3226994662 (99%)  8097128944 (249%)
  ephemeral-storage  100Mi (0%)        100Mi (0%)
  hugepages-1Gi      0 (0%)            0 (0%)
  hugepages-2Mi      0 (0%)            0 (0%)
  hugepages-32Mi     0 (0%)            0 (0%)
  hugepages-64Ki     0 (0%)            0 (0%)
Events:
  Type    Reason             Age                  From       Message
  ----    ------             ----                 ----       -------
  Normal  DisruptionBlocked  101s (x41 over 84m)  karpenter  Cannot disrupt Node: state node doesn't contain both a node and a nodeclaim
fullykubed commented 1 month ago

This is normal as these are the EKS node group nodes, not nodes managed by Karpenter. You can tell this by the eks.amazonaws.com/nodegroup annotation.

Karpenter still sees them but issues a DisruptionBlocked event b/c it will not disrupt nodes that it did not create (with a NodeClaim).

Why do we have EKS node group nodes? Karpenter (and a handful of other controllers) cannot run on nodes managed by Karpenter.

Why do these nodes last for several days? These nodes are only replaced when the underlying OS receives an update and you re-run terragrunt apply.

What does this mean for you? In later versions of the Stack, these nodes are tainted with controller=true, so your pods will never be scheduled on them unless they tolerate this taint. As a result, their idiosyncrasies should never impact you. See https://panfactum.com/docs/edge/guides/deploying-workloads/basics#node-classes for more information.