Closed uptownhr closed 2 months ago
Can you post the entire YAML manifest for the node where you see this?
Can you post the entire YAML manifest for the node where you see this?
updated
@uptownhr Can you share the yaml manifests for the spot-arm
NodePool and the spot
EC2NodeClass?
Spot arm
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
annotations:
compatibility.karpenter.sh/v1beta1-kubelet-conversion: '{"maxPods":110}'
compatibility.karpenter.sh/v1beta1-nodeclass-reference: '{"kind":"EC2NodeClass","name":"spot","apiVersion":"karpenter.k8s.aws/v1beta1"}'
karpenter.sh/nodepool-hash: "7261467632703008228"
karpenter.sh/nodepool-hash-version: v3
creationTimestamp: "2024-08-22T20:30:51Z"
generation: 37
labels:
id: 20c7fa5986a581df
panfactum.com/environment: production
panfactum.com/local: "false"
panfactum.com/module: kube_karpenter_node_pools
panfactum.com/region: us-east-2
panfactum.com/root-module: kube_karpenter_node_pools
panfactum.com/stack-commit: 9baecb3757767a0965d47bc6d482427ca316239a
panfactum.com/stack-version: edge.24-08-13
name: spot-arm
resourceVersion: "41344917"
uid: 934b3d40-e0e8-4276-bfcd-4e0cdce40149
spec:
disruption:
budgets:
- nodes: 10%
consolidateAfter: 0s
consolidationPolicy: WhenEmptyOrUnderutilized
template:
metadata:
labels:
panfactum.com/class: spot
spec:
expireAfter: 24h
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: spot
requirements:
- key: karpenter.k8s.aws/instance-category
operator: In
values:
- c
- m
- r
- key: karpenter.k8s.aws/instance-generation
operator: Gt
values:
- "5"
- key: kubernetes.io/os
operator: In
values:
- linux
- key: karpenter.k8s.aws/instance-memory
operator: Gt
values:
- "2500"
- key: karpenter.sh/capacity-type
operator: In
values:
- spot
- key: kubernetes.io/arch
operator: In
values:
- arm64
startupTaints:
- effect: NoSchedule
key: node.cilium.io/agent-not-ready
value: "true"
taints:
- effect: NoSchedule
key: spot
value: "true"
- effect: NoSchedule
key: arm64
value: "true"
weight: 10
status:
conditions:
- lastTransitionTime: "2024-09-07T22:41:51Z"
message: ""
reason: NodeClassReady
status: "True"
type: NodeClassReady
- lastTransitionTime: "2024-09-07T22:41:51Z"
message: ""
reason: Ready
status: "True"
type: Ready
- lastTransitionTime: "2024-09-07T22:41:50Z"
message: ""
reason: ValidationSucceeded
status: "True"
type: ValidationSucceeded
resources:
cpu: "7"
ephemeral-storage: 204470Mi
hugepages-1Gi: "0"
hugepages-2Mi: "0"
hugepages-32Mi: "0"
hugepages-64Ki: "0"
memory: 39834868Ki
nodes: "5"
pods: "82"
spot
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
annotations:
compatibility.karpenter.sh/v1beta1-kubelet-conversion: '{"maxPods":110}'
compatibility.karpenter.sh/v1beta1-nodeclass-reference: '{"kind":"EC2NodeClass","name":"spot","apiVersion":"karpenter.k8s.aws/v1beta1"}'
karpenter.sh/nodepool-hash: "12683124325173128385"
karpenter.sh/nodepool-hash-version: v3
creationTimestamp: "2024-08-22T20:30:50Z"
generation: 36
labels:
id: 20c7fa5986a581df
panfactum.com/environment: production
panfactum.com/local: "false"
panfactum.com/module: kube_karpenter_node_pools
panfactum.com/region: us-east-2
panfactum.com/root-module: kube_karpenter_node_pools
panfactum.com/stack-commit: 9baecb3757767a0965d47bc6d482427ca316239a
panfactum.com/stack-version: edge.24-08-13
name: spot
resourceVersion: "41330704"
uid: 3a367797-59b9-4e68-bb03-119cb7f31086
spec:
disruption:
budgets:
- nodes: 10%
consolidateAfter: 0s
consolidationPolicy: WhenEmptyOrUnderutilized
template:
metadata:
labels:
panfactum.com/class: spot
spec:
expireAfter: 24h
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: spot
requirements:
- key: karpenter.k8s.aws/instance-category
operator: In
values:
- c
- m
- r
- key: karpenter.k8s.aws/instance-generation
operator: Gt
values:
- "5"
- key: kubernetes.io/os
operator: In
values:
- linux
- key: karpenter.k8s.aws/instance-memory
operator: Gt
values:
- "2500"
- key: karpenter.sh/capacity-type
operator: In
values:
- spot
- key: kubernetes.io/arch
operator: In
values:
- amd64
startupTaints:
- effect: NoSchedule
key: node.cilium.io/agent-not-ready
value: "true"
taints:
- effect: NoSchedule
key: spot
value: "true"
weight: 10
status:
conditions:
- lastTransitionTime: "2024-09-07T22:41:51Z"
message: ""
reason: NodeClassReady
status: "True"
type: NodeClassReady
- lastTransitionTime: "2024-09-07T22:41:51Z"
message: ""
reason: Ready
status: "True"
type: Ready
- lastTransitionTime: "2024-09-07T22:41:51Z"
message: ""
reason: ValidationSucceeded
status: "True"
type: ValidationSucceeded
resources:
cpu: "0"
ephemeral-storage: "0"
memory: "0"
nodes: "0"
pods: "0"
@uptownhr You shared the spot NodePool but I need to see the spot EC2NodeClass.
ec2 node class for spot
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
annotations:
karpenter.k8s.aws/ec2nodeclass-hash: "1750609685124699303"
karpenter.k8s.aws/ec2nodeclass-hash-version: v3
creationTimestamp: "2024-08-08T03:45:50Z"
finalizers:
- karpenter.k8s.aws/termination
generation: 2
labels:
id: 20c7fa5986a581df
panfactum.com/environment: production
panfactum.com/local: "false"
panfactum.com/module: kube_karpenter_node_pools
panfactum.com/region: us-east-2
panfactum.com/root-module: kube_karpenter_node_pools
panfactum.com/stack-commit: 9baecb3757767a0965d47bc6d482427ca316239a
panfactum.com/stack-version: edge.24-08-13
name: spot
resourceVersion: "41394504"
uid: 00104200-6b16-49e2-9ff8-f15d3ceab6b9
spec:
amiSelectorTerms:
- alias: bottlerocket@latest
blockDeviceMappings:
- deviceName: /dev/xvda
ebs:
deleteOnTermination: true
encrypted: true
volumeSize: 25Gi
volumeType: gp3
- deviceName: /dev/xvdb
ebs:
deleteOnTermination: true
encrypted: true
volumeSize: 40Gi
volumeType: gp3
instanceProfile: production-primary-node-20240717014105024500000008
metadataOptions:
httpEndpoint: enabled
httpProtocolIPv6: disabled
httpPutResponseHopLimit: 1
httpTokens: required
securityGroupSelectorTerms:
- id: sg-0b41eb1287de43d13
subnetSelectorTerms:
- id: subnet-0b11ba45edf03ed4b
- id: subnet-07085f174a6f72eb2
- id: subnet-0878bf92175b1307d
userData: |+
[settings.kubernetes]
api-server = "https://20D4ED2C1319774D9D1435564C02AFE7.gr7.us-east-2.eks.amazonaws.com"
cluster-certificate = "LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSURCVENDQWUyZ0F3SUJBZ0lJTEhQUXhkK1BlcUV3RFFZSktvWklodmNOQVFFTEJRQXdGVEVUTUJFR0ExVUUKQXhNS2EzVmlaWEp1WlhSbGN6QWVGdzB5TkRBM01UY3dNVFF3TkRSYUZ3MHpOREEzTVRVd01UUTFORFJhTUJVeApFekFSQmdOVkJBTVRDbXQxWW1WeWJtVjBaWE13Z2dFaU1BMEdDU3FHU0liM0RRRUJBUVVBQTRJQkR3QXdnZ0VLCkFvSUJBUURHWlR5K2dmR1JEd2wvMGdvYjZXRmhaQmpsMFhxQnJ5NnE5dXpzUzdBdUljM2hLT2d1aUhRT2hYVWYKZUJCUG9ub2kvbE1sc2xxRlIrSFdGNG1MVUZhdWdtZVhCd0QwMG5vbUpiNWNwbEZMVk1HZ0tUT2VpUzM4b3VaSwpMYzlQaWRKMC9EeVBmeU1BSU5rb0JqUnVIQWo4NVhxTlNNbHJxOXp5eFk0QTBYbmp4TnFHclpQdE5DOGlwdHVXCmF1YWhvNCtRYnU5MVorVjh1NmNWUWZ6cjlRbm82WEFTNXdDcERzTmJhZjVLOVNYbnEzbXBLVGFzZWNMbmxJSjgKSGtwQkZZWnQrbFJ4TlJyN252RzZUZnBkTVpKemg5dm1zOWo0dEl2MTgvU3FuZ3dzQm9pSTRJM3krc2ZONnlrZQo2dlZmaFBiOGhUTy9rVVlPVGR6SkpBNnBvWTZaQWdNQkFBR2pXVEJYTUE0R0ExVWREd0VCL3dRRUF3SUNwREFQCkJnTlZIUk1CQWY4RUJUQURBUUgvTUIwR0ExVWREZ1FXQkJUZktFRFVoMzc5a09yWXQ0Y2ZuaHlFUmNSQk5EQVYKQmdOVkhSRUVEakFNZ2dwcmRXSmxjbTVsZEdWek1BMEdDU3FHU0liM0RRRUJDd1VBQTRJQkFRQ2o1VmxkVEJ0UQpTK0pSSDZwQjVqZ1I5bTZBMXhKbHlzQ0VYQWQramlPR01kMnNNQm5sZnltbkFmb1lveEFUNWhkS2ZlWVJFS3JYCnlMZ1A5cmZMMVNTNnVxZk9NUEZ5VFJUR0djbUl1azRoQ1RpbXU2Q0piKzZ4MG42UDlzN2k5SU11UjhVRk9vWjEKY3JmOS9laVBYaUp1MEx1M3ZseGlPaTR0N3VwMmw3V3A2ZUhtOVhpOWJaZGkxQ1REeDBkRlZDU0lDb2hmay91Zgo0TUU2N1Q1MXNyMFNHZVhZdU9VMC9KNUxPUHY0eVluNnVFK1lkUkYyNm5JZ1Fmb1VZNVZKNnE3VnpTcjQwSnRTCmR5MThQRVV6akcrWnBpUWpQQkNnN2NNblMzYXBkNTlPTWVIQzZ1VnFWVmR6ZVpvYUVkajZYME9naS9lMEZweDYKMXIrR0RVL3pmTkdLCi0tLS0tRU5EIENFUlRJRklDQVRFLS0tLS0K"
cluster-name = "production-primary"
cluster-dns-ip = "172.20.0.10"
shutdown-grace-period = "2m0s"
shutdown-grace-period-for-critical-pods = "1m0s"
image-gc-high-threshold-percent = "85"
image-gc-low-threshold-percent = "80"
max-pods = 110
allowed-unsafe-sysctls = ["net.core.somaxconn"]
[settings.kubernetes.eviction-hard]
"memory.available" = "100Mi"
"nodefs.available" = "10%"
"nodefs.inodesFree" = "10%"
[settings.kubernetes.eviction-soft]
"memory.available" = "250Mi"
"nodefs.available" = "15%"
"nodefs.inodesFree" = "15%"
[settings.kubernetes.eviction-soft-grace-period]
"memory.available" = "2m0s"
"nodefs.available" = "2m0s"
"nodefs.inodesFree" = "2m0s"
[settings.kubernetes.kube-reserved]
memory = "500Mi"
[settings.kubernetes.system-reserved]
memory = "100Mi"
[settings.kernel.sysctl]
"user.max_user_namespaces" = "16384"
"vm.max_map_count" = "262144"
status:
amis:
- id: ami-02203bfad2253fafc
name: bottlerocket-aws-k8s-1.29-nvidia-aarch64-v1.21.1-82691b51
requirements:
- key: kubernetes.io/arch
operator: In
values:
- arm64
- key: karpenter.k8s.aws/instance-gpu-count
operator: Exists
- id: ami-02203bfad2253fafc
name: bottlerocket-aws-k8s-1.29-nvidia-aarch64-v1.21.1-82691b51
requirements:
- key: kubernetes.io/arch
operator: In
values:
- arm64
- key: karpenter.k8s.aws/instance-accelerator-count
operator: Exists
- id: ami-0841fd5d2ad391b02
name: bottlerocket-aws-k8s-1.29-aarch64-v1.21.1-82691b51
requirements:
- key: kubernetes.io/arch
operator: In
values:
- arm64
- key: karpenter.k8s.aws/instance-gpu-count
operator: DoesNotExist
- key: karpenter.k8s.aws/instance-accelerator-count
operator: DoesNotExist
- id: ami-08bc72f66a12d0908
name: bottlerocket-aws-k8s-1.29-nvidia-x86_64-v1.21.1-82691b51
requirements:
- key: kubernetes.io/arch
operator: In
values:
- amd64
- key: karpenter.k8s.aws/instance-accelerator-count
operator: Exists
- id: ami-08bc72f66a12d0908
name: bottlerocket-aws-k8s-1.29-nvidia-x86_64-v1.21.1-82691b51
requirements:
- key: kubernetes.io/arch
operator: In
values:
- amd64
- key: karpenter.k8s.aws/instance-gpu-count
operator: Exists
- id: ami-005ef77fe508d409f
name: bottlerocket-aws-k8s-1.29-x86_64-v1.21.1-82691b51
requirements:
- key: kubernetes.io/arch
operator: In
values:
- amd64
- key: karpenter.k8s.aws/instance-gpu-count
operator: DoesNotExist
- key: karpenter.k8s.aws/instance-accelerator-count
operator: DoesNotExist
conditions:
- lastTransitionTime: "2024-09-07T22:41:51Z"
message: ""
reason: AMIsReady
status: "True"
type: AMIsReady
- lastTransitionTime: "2024-09-07T22:41:51Z"
message: ""
reason: InstanceProfileReady
status: "True"
type: InstanceProfileReady
- lastTransitionTime: "2024-09-07T22:41:51Z"
message: ""
reason: Ready
status: "True"
type: Ready
- lastTransitionTime: "2024-09-07T22:41:51Z"
message: ""
reason: SecurityGroupsReady
status: "True"
type: SecurityGroupsReady
- lastTransitionTime: "2024-09-07T22:41:51Z"
message: ""
reason: SubnetsReady
status: "True"
type: SubnetsReady
instanceProfile: production-primary-node-20240717014105024500000008
securityGroups:
- id: sg-0b41eb1287de43d13
name: production-primary-nodes-20240717014104717600000004
subnets:
- id: subnet-07085f174a6f72eb2
zone: us-east-2b
zoneID: use2-az2
- id: subnet-0b11ba45edf03ed4b
zone: us-east-2a
zoneID: use2-az1
- id: subnet-0878bf92175b1307d
zone: us-east-2c
zoneID: use2-az3
@uptownhr According to these manifests, your kube_karpenter_node_pools module is on version edge.24-08-13
of the stack. Can you upgrade to the latest and try again.
Attempting to apply the node pools is resulting in an error that I have not seen before.
ubernetes_manifest" "default_node_class" {
│
│ The API returned the following conflict: "Apply failed with 2 conflicts:
│ conflicts with \"before-first-apply\" using karpenter.k8s.aws/v1beta1:\n-
│ .metadata.labels.panfactum.com/stack-commit\n-
│ .metadata.labels.panfactum.com/stack-version"
│
│ You can override this conflict by setting "force_conflicts" to true in the
│ "field_manager" block.
╵
╷
│ Error: There was a field manager conflict when trying to apply the manifest for "/spot"
│
│ with kubernetes_manifest.spot_node_class,
│ on main.tf line 252, in resource "kubernetes_manifest" "spot_node_class":
│ 252: resource "kubernetes_manifest" "spot_node_class" {
│
│ The API returned the following conflict: "Apply failed with 2 conflicts:
│ conflicts with \"before-first-apply\" using karpenter.k8s.aws/v1beta1:\n-
│ .metadata.labels.panfactum.com/stack-commit\n-
│ .metadata.labels.panfactum.com/stack-version"
│
│ You can override this conflict by setting "force_conflicts" to true in the
│ "field_manager" block.
╵
╷
│ Error: There was a field manager conflict when trying to apply the manifest for "/burstable"
│
│ with kubernetes_manifest.burstable_node_class,
│ on main.tf line 270, in resource "kubernetes_manifest" "burstable_node_class":
│ 270: resource "kubernetes_manifest" "burstable_node_class" {
│
│ The API returned the following conflict: "Apply failed with 2 conflicts:
│ conflicts with \"before-first-apply\" using karpenter.k8s.aws/v1beta1:\n-
│ .metadata.labels.panfactum.com/stack-commit\n-
│ .metadata.labels.panfactum.com/stack-version"
│
│ You can override this conflict by setting "force_conflicts" to true in the
│ "field_manager" block.
I have attempted to delete the existing node classes, and although burstable
and default
deleted and recreated with latest versions, I was not able to delete spot
due to
Waiting on NodeClaim termination for spot-arm-qhbpg, spot-arm-tj8wz, spot-arm-hfclt, spot-arm-cpnpq, spot-arm-g8lgb
@fullykubed Looking at the karpenter node pools module is showing that force_conflicts = true
was added. Does this need to be wrapped in field_manager
though?
@uptownhr On the latest version of the Panfactum Stack, the conflict resolution for this module should work.
I have upgraded to edge.24-09-10
and followed the steps and have bypassed the issue.
Prior Search
What happened?
I recently upgraded to the
edge.09-04
and after playing around with suspending and restoring the eks nodes, I've noticed a few pods lingering around aspending
The entire cluster stayed at 25% utilization and should have been rescheudled. I've confirmed that the node the pod was targeting to schedule for is only 40% utilized. However noticed that the POD capacity is sitting at 8 with 6 running.
Steps to Reproduce
edge.09-04
Relevant log output
Reviewing the POD details shows the event message
Pods running on node
node manifest