Panfactum / stack

The Panfactum Stack
https://panfactum.com
Other
16 stars 5 forks source link

[Bug]: kube-fledged pod pending to be scheduled "1 Too many pods" #129

Closed uptownhr closed 2 months ago

uptownhr commented 2 months ago

Prior Search

What happened?

I recently upgraded to the edge.09-04 and after playing around with suspending and restoring the eks nodes, I've noticed a few pods lingering around as pending

The entire cluster stayed at 25% utilization and should have been rescheudled. I've confirmed that the node the pod was targeting to schedule for is only 40% utilized. However noticed that the POD capacity is sitting at 8 with 6 running.

Steps to Reproduce

  1. upgrade to edge.09-04
  2. destroy and bring back the eks nodes

Relevant log output

Reviewing the POD details shows the event message

0/7 nodes are available: 1 Too many pods, 6 node(s) didn't match Pod's node affinity/selector. preemption: 0/7 nodes are available: 1 No p │
│ reemption victims found for incoming pod, 6 Preemption is not helpful for scheduling.           

Pods running on node

image

node manifest

apiVersion: v1
kind: Node
metadata:
  annotations:
    alpha.kubernetes.io/provided-node-ip: 10.0.185.96
    compatibility.karpenter.k8s.aws/kubelet-drift-hash: "9225586735335466555"
    csi.volume.kubernetes.io/nodeid: '{"ebs.csi.aws.com":"i-0ed85ad170df0d43a","secrets-store.csi.k8s.io":"ip-10-0-185-96.us-east-2.compute.internal"}'
    karpenter.k8s.aws/ec2nodeclass-hash: "1750609685124699303"
    karpenter.k8s.aws/ec2nodeclass-hash-version: v3
    karpenter.sh/nodepool-hash: "7261467632703008228"
    karpenter.sh/nodepool-hash-version: v3
    node.alpha.kubernetes.io/ttl: "0"
    volumes.kubernetes.io/controller-managed-attach-detach: "true"
  creationTimestamp: "2024-09-10T13:45:14Z"
  finalizers:
  - karpenter.sh/termination
  labels:
    beta.kubernetes.io/arch: arm64
    beta.kubernetes.io/instance-type: r6g.medium
    beta.kubernetes.io/os: linux
    failure-domain.beta.kubernetes.io/region: us-east-2
    failure-domain.beta.kubernetes.io/zone: us-east-2b
    k8s.io/cloud-provider-aws: 1eca48abf50de6dbb7b17d2b5d457797
    karpenter.k8s.aws/instance-category: r
    karpenter.k8s.aws/instance-cpu: "1"
    karpenter.k8s.aws/instance-cpu-manufacturer: aws
    karpenter.k8s.aws/instance-ebs-bandwidth: "4750"
    karpenter.k8s.aws/instance-encryption-in-transit-supported: "false"
    karpenter.k8s.aws/instance-family: r6g
    karpenter.k8s.aws/instance-generation: "6"
    karpenter.k8s.aws/instance-hypervisor: nitro
    karpenter.k8s.aws/instance-memory: "8192"
    karpenter.k8s.aws/instance-network-bandwidth: "500"
    karpenter.k8s.aws/instance-size: medium
    karpenter.sh/capacity-type: spot
    karpenter.sh/initialized: "true"
    karpenter.sh/nodepool: spot-arm
    karpenter.sh/registered: "true"
    kubernetes.io/arch: arm64
    kubernetes.io/hostname: ip-10-0-185-96.us-east-2.compute.internal
    kubernetes.io/os: linux
    node.kubernetes.io/instance-type: r6g.medium
    panfactum.com/class: spot
    topology.ebs.csi.aws.com/zone: us-east-2b
    topology.k8s.aws/zone-id: use2-az2
    topology.kubernetes.io/region: us-east-2
    topology.kubernetes.io/zone: us-east-2b
  name: ip-10-0-185-96.us-east-2.compute.internal
  ownerReferences:
  - apiVersion: karpenter.sh/v1
    blockOwnerDeletion: true
    kind: NodeClaim
    name: spot-arm-g8c8v
    uid: de1ab30d-c1ab-41b7-aa4f-bf37f3cfb7ef
  resourceVersion: "41377552"
  uid: 881dff9a-5b10-4cf4-9a09-010a09b5ada2
spec:
  providerID: aws:///us-east-2b/i-0ed85ad170df0d43a
  taints:
  - effect: NoSchedule
    key: arm64
    value: "true"
  - effect: NoSchedule
    key: spot
    value: "true"
status:
  addresses:
  - address: 10.0.185.96
    type: InternalIP
  - address: ip-10-0-185-96.us-east-2.compute.internal
    type: InternalDNS
  - address: ip-10-0-185-96.us-east-2.compute.internal
    type: Hostname
  allocatable:
    cpu: 940m
    ephemeral-storage: "37518678362"
    hugepages-1Gi: "0"
    hugepages-2Mi: "0"
    hugepages-32Mi: "0"
    hugepages-64Ki: "0"
    memory: 7278428Ki
    pods: "8"
  capacity:
    cpu: "1"
    ephemeral-storage: 40894Mi
    hugepages-1Gi: "0"
    hugepages-2Mi: "0"
    hugepages-32Mi: "0"
    hugepages-64Ki: "0"
    memory: 7995228Ki
    pods: "8"
  conditions:
  - lastHeartbeatTime: "2024-09-10T13:45:45Z"
    lastTransitionTime: "2024-09-10T13:45:45Z"
    message: Cilium is running on this node
    reason: CiliumIsUp
    status: "False"
    type: NetworkUnavailable
  - lastHeartbeatTime: "2024-09-10T14:32:53Z"
    lastTransitionTime: "2024-09-10T13:45:14Z"
    message: kubelet has sufficient memory available
    reason: KubeletHasSufficientMemory
    status: "False"
    type: MemoryPressure
  - lastHeartbeatTime: "2024-09-10T14:32:53Z"
    lastTransitionTime: "2024-09-10T13:45:14Z"
    message: kubelet has no disk pressure
    reason: KubeletHasNoDiskPressure
    status: "False"
    type: DiskPressure
  - lastHeartbeatTime: "2024-09-10T14:32:53Z"
    lastTransitionTime: "2024-09-10T13:45:14Z"
    message: kubelet has sufficient PID available
    reason: KubeletHasSufficientPID
    status: "False"
    type: PIDPressure
  - lastHeartbeatTime: "2024-09-10T14:32:53Z"
    lastTransitionTime: "2024-09-10T13:45:38Z"
    message: kubelet is posting ready status
    reason: KubeletReady
    status: "True"
    type: Ready
  daemonEndpoints:
    kubeletEndpoint:
      Port: 10250
  images:
  - names:
    - 471112529049.dkr.ecr.us-east-2.amazonaws.com/ecr-public/t8f0s7h5/panfactum@sha256:82de8667d5accbbe979f09a9dea86a2465255c283688e8522f3a0c11a8256a5c
    - 471112529049.dkr.ecr.us-east-2.amazonaws.com/ecr-public/t8f0s7h5/panfactum:a06f797c280dd78132d322dbf5dd416955857d13
    sizeBytes: 1105032518
  - names:
    - 471112529049.dkr.ecr.us-east-2.amazonaws.com/github/cloudnative-pg/postgresql@sha256:82827bc9bc5ca7df1d7f7d4813444e0e7a8e32633ad72c5c66ad2be72c3b2095
    - 471112529049.dkr.ecr.us-east-2.amazonaws.com/github/cloudnative-pg/postgresql:16.2-10
    sizeBytes: 212503687
  - names:
    - 471112529049.dkr.ecr.us-east-2.amazonaws.com/quay/cilium/cilium@sha256:bfeb3f1034282444ae8c498dca94044df2b9c9c8e7ac678e0b43c849f0b31746
    sizeBytes: 195832613
  - names:
    - 471112529049.dkr.ecr.us-east-2.amazonaws.com/github/cloudnative-pg/pgbouncer@sha256:033a2e0470365215da6cbe78d7045a45e7687dfeb4b422c1c8e8c5c58745dea0
    - 471112529049.dkr.ecr.us-east-2.amazonaws.com/github/cloudnative-pg/pgbouncer:1.22.1
    sizeBytes: 128543478
  - names:
    - 471112529049.dkr.ecr.us-east-2.amazonaws.com/docker-hub/hashicorp/vault@sha256:865a6d19531c51398ae02c3e013e7e42d9b424a12c25dcf5e3e767987275c4e3
    - 471112529049.dkr.ecr.us-east-2.amazonaws.com/docker-hub/hashicorp/vault:1.14.7
    sizeBytes: 127786995
  - names:
    - 471112529049.dkr.ecr.us-east-2.amazonaws.com/kubernetes/ingress-nginx/controller@sha256:42b3f0e5d0846876b1791cd3afeb5f1cbbe4259d6f35651dcc1b5c980925379c
    sizeBytes: 95119480
  - names:
    - 471112529049.dkr.ecr.us-east-2.amazonaws.com/kubernetes/csi-secrets-store/driver@sha256:a5b03f73c89d2c2e72e235924a7a8712c8ccffb1d15d7f6804ab5cbd077eacc8
    - 471112529049.dkr.ecr.us-east-2.amazonaws.com/kubernetes/csi-secrets-store/driver:v1.4.2
    sizeBytes: 63114096
  - names:
    - 471112529049.dkr.ecr.us-east-2.amazonaws.com/quay/argoproj/argoexec@sha256:32a568bd1ecb2691a61aa4a646d90b08fe5c4606a2d5cbf264565b1ced98f12b
    - 471112529049.dkr.ecr.us-east-2.amazonaws.com/quay/argoproj/argoexec:v3.5.5
    sizeBytes: 44945247
  - names:
    - 471112529049.dkr.ecr.us-east-2.amazonaws.com/github/cloudnative-pg/cloudnative-pg@sha256:9b130e8fe2af90c3c9d245ef9fed0a8ef3b33d5784580544596f935ca0daadc4
    - 471112529049.dkr.ecr.us-east-2.amazonaws.com/github/cloudnative-pg/cloudnative-pg:1.23.1
    sizeBytes: 31751056
  - names:
    - 471112529049.dkr.ecr.us-east-2.amazonaws.com/ecr-public/ebs-csi-driver/aws-ebs-csi-driver@sha256:f016caf713c2d191a1e9b900de4cb52e8454f24cfef9aff32737506af2a07efd
    - 471112529049.dkr.ecr.us-east-2.amazonaws.com/ecr-public/ebs-csi-driver/aws-ebs-csi-driver:v1.34.0
    sizeBytes: 27599915
  - names:
    - 471112529049.dkr.ecr.us-east-2.amazonaws.com/docker-hub/hashicorp/vault-csi-provider@sha256:bb7d5776ce1501dbda0b417ca6199134e9f27b4047a3b4068f243d30bf69a69f
    - 471112529049.dkr.ecr.us-east-2.amazonaws.com/docker-hub/hashicorp/vault-csi-provider:1.4.1
    sizeBytes: 25108100
  - names:
    - 471112529049.dkr.ecr.us-east-2.amazonaws.com/github/linkerd/proxy@sha256:6ecc3ede913be8014a3f93c34bf6a2e6fbd1f4009f3d39d134b925d609529402
    - 471112529049.dkr.ecr.us-east-2.amazonaws.com/github/linkerd/proxy:edge-24.5.1
    sizeBytes: 21317551
  - names:
    - 471112529049.dkr.ecr.us-east-2.amazonaws.com/ecr-public/eks/aws-load-balancer-controller@sha256:51030bf625a1477e4a78e8efddf95ee85887c71d090caf9a0844c6d5c100c7f6
    - 471112529049.dkr.ecr.us-east-2.amazonaws.com/ecr-public/eks/aws-load-balancer-controller:v2.8.0
    sizeBytes: 18114061
  - names:
    - 471112529049.dkr.ecr.us-east-2.amazonaws.com/kubernetes/sig-storage/livenessprobe@sha256:5baeb4a6d7d517434292758928bb33efc6397368cbb48c8a4cf29496abf4e987
    - 471112529049.dkr.ecr.us-east-2.amazonaws.com/kubernetes/sig-storage/livenessprobe:v2.12.0
    sizeBytes: 12635307
  - names:
    - 471112529049.dkr.ecr.us-east-2.amazonaws.com/kubernetes/sig-storage/csi-node-driver-registrar@sha256:c53535af8a7f7e3164609838c4b191b42b2d81238d75c1b2a2b582ada62a9780
    - 471112529049.dkr.ecr.us-east-2.amazonaws.com/kubernetes/sig-storage/csi-node-driver-registrar:v2.10.0
    sizeBytes: 10291112
  - names:
    - 471112529049.dkr.ecr.us-east-2.amazonaws.com/github/linkerd/proxy-init@sha256:5bd804267a4e0b585c5e6e1e1cbf5d91887ed73be84e35fe784df2331b6e9c61
    - 471112529049.dkr.ecr.us-east-2.amazonaws.com/github/linkerd/proxy-init:v2.4.0
    sizeBytes: 9292400
  - names:
    - 471112529049.dkr.ecr.us-east-2.amazonaws.com/ecr-public/eks-distro/kubernetes-csi/node-driver-registrar@sha256:34eb40a019fd01ec59fc209b56b9b48770593c3f001e9066b3bb2537e529e471
    - 471112529049.dkr.ecr.us-east-2.amazonaws.com/ecr-public/eks-distro/kubernetes-csi/node-driver-registrar:v2.11.0-eks-1-30-10
    sizeBytes: 7904464
  - names:
    - 471112529049.dkr.ecr.us-east-2.amazonaws.com/ecr-public/eks-distro/kubernetes-csi/livenessprobe@sha256:f8c95874f8f654a8762b077f165ce60468f02ee73165ae56319768de450a6273
    - 471112529049.dkr.ecr.us-east-2.amazonaws.com/ecr-public/eks-distro/kubernetes-csi/livenessprobe:v2.13.0-eks-1-30-10
    sizeBytes: 7877358
  - names:
    - 471112529049.dkr.ecr.us-east-2.amazonaws.com/docker-hub/senthilrch/busybox@sha256:d7f4aada301c0f13d93ceed62fef318c195c38bf430fc8bfbdf1d850514422ff
    - 471112529049.dkr.ecr.us-east-2.amazonaws.com/docker-hub/senthilrch/busybox:1.35.0
    sizeBytes: 832382
  - names:
    - localhost/kubernetes/pause:0.1.0
    sizeBytes: 379672
  nodeInfo:
    architecture: arm64
    bootID: 4f2e3678-c4ea-473f-8322-f64ed32cc1b0
    containerRuntimeVersion: containerd://1.7.20+bottlerocket
    kernelVersion: 6.1.102
    kubeProxyVersion: v1.29.5-eks-1109419
    kubeletVersion: v1.29.5-eks-1109419
    machineID: ec2809ecbaa8b6de0415e191224586b8
    operatingSystem: linux
    osImage: Bottlerocket OS 1.21.1 (aws-k8s-1.29)
    systemUUID: ec2809ec-baa8-b6de-0415-e191224586b8
  volumesAttached:
  - devicePath: ""
    name: kubernetes.io/csi/ebs.csi.aws.com^vol-0ba74db3ef876e5ae
  volumesInUse:
  - kubernetes.io/csi/ebs.csi.aws.com^vol-0ba74db3ef876e5ae
fullykubed commented 2 months ago

Can you post the entire YAML manifest for the node where you see this?

uptownhr commented 2 months ago

Can you post the entire YAML manifest for the node where you see this?

updated

fullykubed commented 2 months ago

@uptownhr Can you share the yaml manifests for the spot-arm NodePool and the spot EC2NodeClass?

uptownhr commented 2 months ago

Spot arm

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  annotations:
    compatibility.karpenter.sh/v1beta1-kubelet-conversion: '{"maxPods":110}'
    compatibility.karpenter.sh/v1beta1-nodeclass-reference: '{"kind":"EC2NodeClass","name":"spot","apiVersion":"karpenter.k8s.aws/v1beta1"}'
    karpenter.sh/nodepool-hash: "7261467632703008228"
    karpenter.sh/nodepool-hash-version: v3
  creationTimestamp: "2024-08-22T20:30:51Z"
  generation: 37
  labels:
    id: 20c7fa5986a581df
    panfactum.com/environment: production
    panfactum.com/local: "false"
    panfactum.com/module: kube_karpenter_node_pools
    panfactum.com/region: us-east-2
    panfactum.com/root-module: kube_karpenter_node_pools
    panfactum.com/stack-commit: 9baecb3757767a0965d47bc6d482427ca316239a
    panfactum.com/stack-version: edge.24-08-13
  name: spot-arm
  resourceVersion: "41344917"
  uid: 934b3d40-e0e8-4276-bfcd-4e0cdce40149
spec:
  disruption:
    budgets:
    - nodes: 10%
    consolidateAfter: 0s
    consolidationPolicy: WhenEmptyOrUnderutilized
  template:
    metadata:
      labels:
        panfactum.com/class: spot
    spec:
      expireAfter: 24h
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: spot
      requirements:
      - key: karpenter.k8s.aws/instance-category
        operator: In
        values:
        - c
        - m
        - r
      - key: karpenter.k8s.aws/instance-generation
        operator: Gt
        values:
        - "5"
      - key: kubernetes.io/os
        operator: In
        values:
        - linux
      - key: karpenter.k8s.aws/instance-memory
        operator: Gt
        values:
        - "2500"
      - key: karpenter.sh/capacity-type
        operator: In
        values:
        - spot
      - key: kubernetes.io/arch
        operator: In
        values:
        - arm64
      startupTaints:
      - effect: NoSchedule
        key: node.cilium.io/agent-not-ready
        value: "true"
      taints:
      - effect: NoSchedule
        key: spot
        value: "true"
      - effect: NoSchedule
        key: arm64
        value: "true"
  weight: 10
status:
  conditions:
  - lastTransitionTime: "2024-09-07T22:41:51Z"
    message: ""
    reason: NodeClassReady
    status: "True"
    type: NodeClassReady
  - lastTransitionTime: "2024-09-07T22:41:51Z"
    message: ""
    reason: Ready
    status: "True"
    type: Ready
  - lastTransitionTime: "2024-09-07T22:41:50Z"
    message: ""
    reason: ValidationSucceeded
    status: "True"
    type: ValidationSucceeded
  resources:
    cpu: "7"
    ephemeral-storage: 204470Mi
    hugepages-1Gi: "0"
    hugepages-2Mi: "0"
    hugepages-32Mi: "0"
    hugepages-64Ki: "0"
    memory: 39834868Ki
    nodes: "5"
    pods: "82"
uptownhr commented 2 months ago

spot

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  annotations:
    compatibility.karpenter.sh/v1beta1-kubelet-conversion: '{"maxPods":110}'
    compatibility.karpenter.sh/v1beta1-nodeclass-reference: '{"kind":"EC2NodeClass","name":"spot","apiVersion":"karpenter.k8s.aws/v1beta1"}'
    karpenter.sh/nodepool-hash: "12683124325173128385"
    karpenter.sh/nodepool-hash-version: v3
  creationTimestamp: "2024-08-22T20:30:50Z"
  generation: 36
  labels:
    id: 20c7fa5986a581df
    panfactum.com/environment: production
    panfactum.com/local: "false"
    panfactum.com/module: kube_karpenter_node_pools
    panfactum.com/region: us-east-2
    panfactum.com/root-module: kube_karpenter_node_pools
    panfactum.com/stack-commit: 9baecb3757767a0965d47bc6d482427ca316239a
    panfactum.com/stack-version: edge.24-08-13
  name: spot
  resourceVersion: "41330704"
  uid: 3a367797-59b9-4e68-bb03-119cb7f31086
spec:
  disruption:
    budgets:
    - nodes: 10%
    consolidateAfter: 0s
    consolidationPolicy: WhenEmptyOrUnderutilized
  template:
    metadata:
      labels:
        panfactum.com/class: spot
    spec:
      expireAfter: 24h
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: spot
      requirements:
      - key: karpenter.k8s.aws/instance-category
        operator: In
        values:
        - c
        - m
        - r
      - key: karpenter.k8s.aws/instance-generation
        operator: Gt
        values:
        - "5"
      - key: kubernetes.io/os
        operator: In
        values:
        - linux
      - key: karpenter.k8s.aws/instance-memory
        operator: Gt
        values:
        - "2500"
      - key: karpenter.sh/capacity-type
        operator: In
        values:
        - spot
      - key: kubernetes.io/arch
        operator: In
        values:
        - amd64
      startupTaints:
      - effect: NoSchedule
        key: node.cilium.io/agent-not-ready
        value: "true"
      taints:
      - effect: NoSchedule
        key: spot
        value: "true"
  weight: 10
status:
  conditions:
  - lastTransitionTime: "2024-09-07T22:41:51Z"
    message: ""
    reason: NodeClassReady
    status: "True"
    type: NodeClassReady
  - lastTransitionTime: "2024-09-07T22:41:51Z"
    message: ""
    reason: Ready
    status: "True"
    type: Ready
  - lastTransitionTime: "2024-09-07T22:41:51Z"
    message: ""
    reason: ValidationSucceeded
    status: "True"
    type: ValidationSucceeded
  resources:
    cpu: "0"
    ephemeral-storage: "0"
    memory: "0"
    nodes: "0"
    pods: "0"
fullykubed commented 2 months ago

@uptownhr You shared the spot NodePool but I need to see the spot EC2NodeClass.

uptownhr commented 2 months ago

ec2 node class for spot

apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  annotations:
    karpenter.k8s.aws/ec2nodeclass-hash: "1750609685124699303"
    karpenter.k8s.aws/ec2nodeclass-hash-version: v3
  creationTimestamp: "2024-08-08T03:45:50Z"
  finalizers:
  - karpenter.k8s.aws/termination
  generation: 2
  labels:
    id: 20c7fa5986a581df
    panfactum.com/environment: production
    panfactum.com/local: "false"
    panfactum.com/module: kube_karpenter_node_pools
    panfactum.com/region: us-east-2
    panfactum.com/root-module: kube_karpenter_node_pools
    panfactum.com/stack-commit: 9baecb3757767a0965d47bc6d482427ca316239a
    panfactum.com/stack-version: edge.24-08-13
  name: spot
  resourceVersion: "41394504"
  uid: 00104200-6b16-49e2-9ff8-f15d3ceab6b9
spec:
  amiSelectorTerms:
  - alias: bottlerocket@latest
  blockDeviceMappings:
  - deviceName: /dev/xvda
    ebs:
      deleteOnTermination: true
      encrypted: true
      volumeSize: 25Gi
      volumeType: gp3
  - deviceName: /dev/xvdb
    ebs:
      deleteOnTermination: true
      encrypted: true
      volumeSize: 40Gi
      volumeType: gp3
  instanceProfile: production-primary-node-20240717014105024500000008
  metadataOptions:
    httpEndpoint: enabled
    httpProtocolIPv6: disabled
    httpPutResponseHopLimit: 1
    httpTokens: required
  securityGroupSelectorTerms:
  - id: sg-0b41eb1287de43d13
  subnetSelectorTerms:
  - id: subnet-0b11ba45edf03ed4b
  - id: subnet-07085f174a6f72eb2
  - id: subnet-0878bf92175b1307d
  userData: |+
    [settings.kubernetes]
    api-server = "https://20D4ED2C1319774D9D1435564C02AFE7.gr7.us-east-2.eks.amazonaws.com"
    cluster-certificate = "LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSURCVENDQWUyZ0F3SUJBZ0lJTEhQUXhkK1BlcUV3RFFZSktvWklodmNOQVFFTEJRQXdGVEVUTUJFR0ExVUUKQXhNS2EzVmlaWEp1WlhSbGN6QWVGdzB5TkRBM01UY3dNVFF3TkRSYUZ3MHpOREEzTVRVd01UUTFORFJhTUJVeApFekFSQmdOVkJBTVRDbXQxWW1WeWJtVjBaWE13Z2dFaU1BMEdDU3FHU0liM0RRRUJBUVVBQTRJQkR3QXdnZ0VLCkFvSUJBUURHWlR5K2dmR1JEd2wvMGdvYjZXRmhaQmpsMFhxQnJ5NnE5dXpzUzdBdUljM2hLT2d1aUhRT2hYVWYKZUJCUG9ub2kvbE1sc2xxRlIrSFdGNG1MVUZhdWdtZVhCd0QwMG5vbUpiNWNwbEZMVk1HZ0tUT2VpUzM4b3VaSwpMYzlQaWRKMC9EeVBmeU1BSU5rb0JqUnVIQWo4NVhxTlNNbHJxOXp5eFk0QTBYbmp4TnFHclpQdE5DOGlwdHVXCmF1YWhvNCtRYnU5MVorVjh1NmNWUWZ6cjlRbm82WEFTNXdDcERzTmJhZjVLOVNYbnEzbXBLVGFzZWNMbmxJSjgKSGtwQkZZWnQrbFJ4TlJyN252RzZUZnBkTVpKemg5dm1zOWo0dEl2MTgvU3FuZ3dzQm9pSTRJM3krc2ZONnlrZQo2dlZmaFBiOGhUTy9rVVlPVGR6SkpBNnBvWTZaQWdNQkFBR2pXVEJYTUE0R0ExVWREd0VCL3dRRUF3SUNwREFQCkJnTlZIUk1CQWY4RUJUQURBUUgvTUIwR0ExVWREZ1FXQkJUZktFRFVoMzc5a09yWXQ0Y2ZuaHlFUmNSQk5EQVYKQmdOVkhSRUVEakFNZ2dwcmRXSmxjbTVsZEdWek1BMEdDU3FHU0liM0RRRUJDd1VBQTRJQkFRQ2o1VmxkVEJ0UQpTK0pSSDZwQjVqZ1I5bTZBMXhKbHlzQ0VYQWQramlPR01kMnNNQm5sZnltbkFmb1lveEFUNWhkS2ZlWVJFS3JYCnlMZ1A5cmZMMVNTNnVxZk9NUEZ5VFJUR0djbUl1azRoQ1RpbXU2Q0piKzZ4MG42UDlzN2k5SU11UjhVRk9vWjEKY3JmOS9laVBYaUp1MEx1M3ZseGlPaTR0N3VwMmw3V3A2ZUhtOVhpOWJaZGkxQ1REeDBkRlZDU0lDb2hmay91Zgo0TUU2N1Q1MXNyMFNHZVhZdU9VMC9KNUxPUHY0eVluNnVFK1lkUkYyNm5JZ1Fmb1VZNVZKNnE3VnpTcjQwSnRTCmR5MThQRVV6akcrWnBpUWpQQkNnN2NNblMzYXBkNTlPTWVIQzZ1VnFWVmR6ZVpvYUVkajZYME9naS9lMEZweDYKMXIrR0RVL3pmTkdLCi0tLS0tRU5EIENFUlRJRklDQVRFLS0tLS0K"
    cluster-name = "production-primary"
    cluster-dns-ip = "172.20.0.10"
    shutdown-grace-period = "2m0s"
    shutdown-grace-period-for-critical-pods = "1m0s"
    image-gc-high-threshold-percent = "85"
    image-gc-low-threshold-percent = "80"
    max-pods = 110
    allowed-unsafe-sysctls = ["net.core.somaxconn"]

    [settings.kubernetes.eviction-hard]
    "memory.available" = "100Mi"
    "nodefs.available" = "10%"
    "nodefs.inodesFree" = "10%"
    [settings.kubernetes.eviction-soft]
    "memory.available" = "250Mi"
    "nodefs.available" = "15%"
    "nodefs.inodesFree" = "15%"
    [settings.kubernetes.eviction-soft-grace-period]
    "memory.available" = "2m0s"
    "nodefs.available" = "2m0s"
    "nodefs.inodesFree" = "2m0s"

    [settings.kubernetes.kube-reserved]
    memory = "500Mi"

    [settings.kubernetes.system-reserved]
    memory = "100Mi"

    [settings.kernel.sysctl]
    "user.max_user_namespaces" = "16384"
    "vm.max_map_count" = "262144"

status:
  amis:
  - id: ami-02203bfad2253fafc
    name: bottlerocket-aws-k8s-1.29-nvidia-aarch64-v1.21.1-82691b51
    requirements:
    - key: kubernetes.io/arch
      operator: In
      values:
      - arm64
    - key: karpenter.k8s.aws/instance-gpu-count
      operator: Exists
  - id: ami-02203bfad2253fafc
    name: bottlerocket-aws-k8s-1.29-nvidia-aarch64-v1.21.1-82691b51
    requirements:
    - key: kubernetes.io/arch
      operator: In
      values:
      - arm64
    - key: karpenter.k8s.aws/instance-accelerator-count
      operator: Exists
  - id: ami-0841fd5d2ad391b02
    name: bottlerocket-aws-k8s-1.29-aarch64-v1.21.1-82691b51
    requirements:
    - key: kubernetes.io/arch
      operator: In
      values:
      - arm64
    - key: karpenter.k8s.aws/instance-gpu-count
      operator: DoesNotExist
    - key: karpenter.k8s.aws/instance-accelerator-count
      operator: DoesNotExist
  - id: ami-08bc72f66a12d0908
    name: bottlerocket-aws-k8s-1.29-nvidia-x86_64-v1.21.1-82691b51
    requirements:
    - key: kubernetes.io/arch
      operator: In
      values:
      - amd64
    - key: karpenter.k8s.aws/instance-accelerator-count
      operator: Exists
  - id: ami-08bc72f66a12d0908
    name: bottlerocket-aws-k8s-1.29-nvidia-x86_64-v1.21.1-82691b51
    requirements:
    - key: kubernetes.io/arch
      operator: In
      values:
      - amd64
    - key: karpenter.k8s.aws/instance-gpu-count
      operator: Exists
  - id: ami-005ef77fe508d409f
    name: bottlerocket-aws-k8s-1.29-x86_64-v1.21.1-82691b51
    requirements:
    - key: kubernetes.io/arch
      operator: In
      values:
      - amd64
    - key: karpenter.k8s.aws/instance-gpu-count
      operator: DoesNotExist
    - key: karpenter.k8s.aws/instance-accelerator-count
      operator: DoesNotExist
  conditions:
  - lastTransitionTime: "2024-09-07T22:41:51Z"
    message: ""
    reason: AMIsReady
    status: "True"
    type: AMIsReady
  - lastTransitionTime: "2024-09-07T22:41:51Z"
    message: ""
    reason: InstanceProfileReady
    status: "True"
    type: InstanceProfileReady
  - lastTransitionTime: "2024-09-07T22:41:51Z"
    message: ""
    reason: Ready
    status: "True"
    type: Ready
  - lastTransitionTime: "2024-09-07T22:41:51Z"
    message: ""
    reason: SecurityGroupsReady
    status: "True"
    type: SecurityGroupsReady
  - lastTransitionTime: "2024-09-07T22:41:51Z"
    message: ""
    reason: SubnetsReady
    status: "True"
    type: SubnetsReady
  instanceProfile: production-primary-node-20240717014105024500000008
  securityGroups:
  - id: sg-0b41eb1287de43d13
    name: production-primary-nodes-20240717014104717600000004
  subnets:
  - id: subnet-07085f174a6f72eb2
    zone: us-east-2b
    zoneID: use2-az2
  - id: subnet-0b11ba45edf03ed4b
    zone: us-east-2a
    zoneID: use2-az1
  - id: subnet-0878bf92175b1307d
    zone: us-east-2c
    zoneID: use2-az3
fullykubed commented 2 months ago

@uptownhr According to these manifests, your kube_karpenter_node_pools module is on version edge.24-08-13 of the stack. Can you upgrade to the latest and try again.

uptownhr commented 2 months ago

Attempting to apply the node pools is resulting in an error that I have not seen before.

ubernetes_manifest" "default_node_class" {
│ 
│ The API returned the following conflict: "Apply failed with 2 conflicts:
│ conflicts with \"before-first-apply\" using karpenter.k8s.aws/v1beta1:\n-
│ .metadata.labels.panfactum.com/stack-commit\n-
│ .metadata.labels.panfactum.com/stack-version"
│ 
│ You can override this conflict by setting "force_conflicts" to true in the
│ "field_manager" block.
╵
╷
│ Error: There was a field manager conflict when trying to apply the manifest for "/spot"
│ 
│   with kubernetes_manifest.spot_node_class,
│   on main.tf line 252, in resource "kubernetes_manifest" "spot_node_class":
│  252: resource "kubernetes_manifest" "spot_node_class" {
│ 
│ The API returned the following conflict: "Apply failed with 2 conflicts:
│ conflicts with \"before-first-apply\" using karpenter.k8s.aws/v1beta1:\n-
│ .metadata.labels.panfactum.com/stack-commit\n-
│ .metadata.labels.panfactum.com/stack-version"
│ 
│ You can override this conflict by setting "force_conflicts" to true in the
│ "field_manager" block.
╵
╷
│ Error: There was a field manager conflict when trying to apply the manifest for "/burstable"
│ 
│   with kubernetes_manifest.burstable_node_class,
│   on main.tf line 270, in resource "kubernetes_manifest" "burstable_node_class":
│  270: resource "kubernetes_manifest" "burstable_node_class" {
│ 
│ The API returned the following conflict: "Apply failed with 2 conflicts:
│ conflicts with \"before-first-apply\" using karpenter.k8s.aws/v1beta1:\n-
│ .metadata.labels.panfactum.com/stack-commit\n-
│ .metadata.labels.panfactum.com/stack-version"
│ 
│ You can override this conflict by setting "force_conflicts" to true in the
│ "field_manager" block.

I have attempted to delete the existing node classes, and although burstable and default deleted and recreated with latest versions, I was not able to delete spot due to

Waiting on NodeClaim termination for spot-arm-qhbpg, spot-arm-tj8wz, spot-arm-hfclt, spot-arm-cpnpq, spot-arm-g8lgb

uptownhr commented 2 months ago

@fullykubed Looking at the karpenter node pools module is showing that force_conflicts = true was added. Does this need to be wrapped in field_manager though?

fullykubed commented 2 months ago

@uptownhr On the latest version of the Panfactum Stack, the conflict resolution for this module should work.

uptownhr commented 2 months ago

I have upgraded to edge.24-09-10 and followed the steps and have bypassed the issue.