aws / karpenter-provider-aws

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
https://karpenter.sh
Apache License 2.0
6.68k stars 932 forks source link

Karpenter Node NotReady when provided with extra kubelet args #5043

Closed hitsub2 closed 10 months ago

hitsub2 commented 10 months ago

Description

Observed Behavior: When provided the following kubelet args, some nodes(2 out of 400) are not ready and karpenter can not disrupt them, leaving them forever.

Extra kubelet config:

--cpu-manager-policy=static --enforce-node-allocatable=pods,kube-reserved,system-reserved --system-reserved-cgroup=/system.slice --kube-reserved-cgroup=/system.slice

ec2 nodelcass.yaml

apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  name: dc-spark-ue1-prod-memory
spec:
  amiFamily: AL2
  amiSelectorTerms:
  - id: ami-0c97930d0d19e564a
  blockDeviceMappings:
  - deviceName: /dev/xvda
    ebs:
      deleteOnTermination: true
      encrypted: false
      iops: 3000
      throughput: 125
      volumeSize: 200Gi
      volumeType: gp3
  detailedMonitoring: false
  metadataOptions:
    httpEndpoint: enabled
    httpProtocolIPv6: disabled
    httpPutResponseHopLimit: 2
    httpTokens: required
  role: KarpenterNodeRole-dongdgy-karpenter-demo
  securityGroupSelectorTerms:
  - id: sg-03b6a8b2900572e14
  subnetSelectorTerms:
  - id: subnet-041c9f82b633f50ca
  tags:
    Name: dc-spark-ue1-prod-memory
    billing_entry: data_engineering
    billing_group: bigdata
    billing_service: spark
    workload: general
  userData: "#!/bin/bash 
 set -o xtrace 
 mkdir -p /sys/fs/cgroup/cpuset/system.slice && mkdir -p /sys/fs/cgroup/hugetlb/system.slice 
 /etc/eks/bootstrap.sh dc-spark-ue1-prod --kubelet-extra-args '--node-labels=billing_service=spark,lifecycle=Ec2Spot,billing_group=bigdata,billing_entry=data_engineering,workload=general,node_group_name=dc-spark-ue1-prod-memory,NAME=dc-spark-ue1-prod,env=prod --cpu-manager-policy=static --enforce-node-allocatable=pods,kube-reserved,system-reserved --system-reserved-cgroup=/system.slice --kube-reserved-cgroup=/system.slice --kube-reserved=cpu=500m,memory=1Gi,ephemeral-storage=2Gi --system-reserved=cpu=500m,memory=1Gi,ephemeral-storage=2Gi'"

kublet error log


Nov 02 07:55:34 ip-172-17-182-225.ec2.internal kubelet[4330]: E1102 07:55:34.137963    4330 kubelet_node_status.go:539] "Error updating node status, will retry" err="error getting node \"ip-172-17-182-225.ec2.internal\": Get \"https://123456.gr7.us-east-1.eks.amazonaws.com/api/v1/nodes/ip-172-17-182-225.ec2.internal?resourceVersion=0&timeout=10s\": tls: failed to verify certificate: x509: certificate signed by unknown authority"
Nov 02 07:55:34 ip-172-17-182-225.ec2.internal kubelet[4330]: E1102 07:55:34.142093    4330 kubelet_node_status.go:539] "Error updating node status, will retry" err="error getting node \"ip-172-17-182-225.ec2.internal\": Get \"https://123456.gr7.us-east-1.eks.amazonaws.com/api/v1/nodes/ip-172-17-182-225.ec2.internal?timeout=10s\": tls: failed to verify certificate: x509: certificate signed by unknown authority"
Nov 02 07:55:34 ip-172-17-182-225.ec2.internal kubelet[4330]: E1102 07:55:34.146351    4330 kubelet_node_status.go:539] "Error updating node status, will retry" err="error getting node \"ip-172-17-182-225.ec2.internal\": Get \"https://123456.gr7.us-east-1.eks.amazonaws.com/api/v1/nodes/ip-172-17-182-225.ec2.internal?timeout=10s\": tls: failed to verify certificate: x509: certificate signed by unknown authority"
Nov 02 07:55:34 ip-172-17-182-225.ec2.internal kubelet[4330]: E1102 07:55:34.150486    4330 kubelet_node_status.go:539] "Error updating node status, will retry" err="error getting node \"ip-172-17-182-225.ec2.internal\": Get \"https://123456.gr7.us-east-1.eks.amazonaws.com/api/v1/nodes/ip-172-17-182-225.ec2.internal?timeout=10s\": tls: failed to verify certificate: x509: certificate signed by unknown authority"
Nov 02 07:55:34 ip-172-17-182-225.ec2.internal kubelet[4330]: E1102 07:55:34.155528    4330 kubelet_node_status.go:539] "Error updating node status, will retry" err="error getting node \"ip-172-17-182-225.ec2.internal\": Get \"https://123456.gr7.us-east-1.eks.amazonaws.com/api/v1/nodes/ip-172-17-182-225.ec2.internal?timeout=10s\": tls: failed to verify certificate: x509: certificate signed by unknown authority"
Nov 02 07:55:34 ip-172-17-182-225.ec2.internal kubelet[4330]: E1102 07:55:34.155545    4330 kubelet_node_status.go:526] "Unable to update node status" err="update node status exceeds retry count"
Nov 02 07:55:40 ip-172-17-182-225.ec2.internal kubelet[4330]: E1102 07:55:40.396985    4330 controller.go:144] failed to ensure lease exists, will retry in 7s, error: Get "https://123456.gr7.us-east-1.eks.amazonaws.com/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/ip-172-17-182-225.ec2.internal?timeout=10s": tls: failed to verify certificate: x509: certificate signed by unknown authority
Nov 02 07:55:40 ip-172-17-182-225.ec2.internal kubelet[4330]: E1102 07:55:40.406468    4330 event.go:276] Unable to write event: '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"ip-172-17-182-225.ec2.internal.1793bf338cd27e84", GenerateName:"", Namespace:"default", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), DeletionTimestamp:<nil>, DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:"Node", Namespace:"", Name:"ip-172-17-182-225.ec2.internal", UID:"ip-172-17-182-225.ec2.internal", APIVersion:"", ResourceVersion:"", FieldPath:""}, Reason:"Starting", Message:"Starting kubelet.", Source:v1.EventSource{Component:"kubelet", Host:"ip-172-17-182-225.ec2.internal"}, FirstTimestamp:time.Date(2023, time.November, 2, 7, 55, 12, 575651460, time.Local), LastTimestamp:time.Date(2023, time.November, 2, 7, 55, 12, 575651460, time.Local), Count:1, Type:"Normal", EventTime:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'Post "https://123456.gr7.us-east-1.eks.amazonaws.com/api/v1/namespaces/default/events": tls: failed to verify certificate: x509: certificate signed by unknown authority'(may retry after sleeping)
Nov 02 07:55:41 ip-172-17-182-225.ec2.internal kubelet[4330]: I1102 07:55:41.523852    4330 kubelet_resources.go:45] "Allocatable" allocatable=map[attachable-volumes-aws-ebs:{i:{value:25 scale:0} d:{Dec:<nil>} s:25 Format:DecimalSI} cpu:{i:{value:7 scale:0} d:{Dec:<nil>} s:7 Format:DecimalSI} ephemeral-storage:{i:{value:188967217652 scale:0} d:{Dec:<nil>} s:188967217652 Format:DecimalSI} hugepages-1Gi:{i:{value:0 scale:0} d:{Dec:<nil>} s:0 Format:DecimalSI} hugepages-2Mi:{i:{value:0 scale:0} d:{Dec:<nil>} s:0 Format:DecimalSI} memory:{i:{value:64331677696 scale:0} d:{Dec:<nil>} s: Format:BinarySI} pods:{i:{value:58 scale:0} d:{Dec:<nil>} s:58 Format:DecimalSI}]
Nov 02 07:55:41 ip-172-17-182-225.ec2.internal kubelet[4330]: I1102 07:55:41.860071    4330 kubelet.go:2156] "SyncLoop (PLEG): event for pod" pod="kube-admin/collector-wnt57" event=&{ID:d2978f3b-836b-451a-ad25-baf6dcd98f70 Type:ContainerStarted Data:6bc832bd393af94de8f6a7455e60f333a390c286d5b1f474f16e559422a47c85}
Nov 02 07:55:41 ip-172-17-182-225.ec2.internal kubelet[4330]: I1102 07:55:41.860276    4330 kubelet_pods.go:897] "Unable to retrieve pull secret, the image pull may not succeed." pod="kube-admin/collector-wnt57" secret="" err="secret \"default-secret\" not found"
Nov 02 07:55:42 ip-172-17-182-225.ec2.internal kubelet[4330]: I1102 07:55:42.861650    4330 kubelet_pods.go:897] "Unable to retrieve pull secret, the image pull may not succeed." pod="kube-admin/collector-wnt57" secret="" err="secret \"default-secret\" not found"
Nov 02 07:55:42 ip-172-17-182-225.ec2.internal kubelet[4330]: I1102 07:55:42.978675    4330 state_mem.go:80] "Updated desired CPUSet" podUID="d2978f3b-836b-451a-ad25-baf6dcd98f70" containerName="collector" cpuSet="0-7"
Nov 02 07:55:44 ip-172-17-182-225.ec2.internal kubelet[4330]: E1102 07:55:44.322831    4330 kubelet_node_status.go:539] "Error updating node status, will retry" err="error getting node \"ip-172-17-182-225.ec2.internal\": Get \"https://123456.gr7.us-east-1.eks.amazonaws.com/api/v1/nodes/ip-172-17-182-225.ec2.internal?resourceVersion=0&timeout=10s\": tls: failed to verify certificate: x509: certificate signed by unknown authority"
Nov 02 07:55:44 ip-172-17-182-225.ec2.internal kubelet[4330]: E1102 07:55:44.327170    4330 kubelet_node_status.go:539] "Error updating node status, will retry" err="error getting node \"ip-172-17-182-225.ec2.internal\": Get \"https://123456.gr7.us-east-1.eks.amazonaws.com/api/v1/nodes/ip-172-17-182-225.ec2.internal?timeout=10s\": tls: failed to verify certificate: x509: certificate signed by unknown authority"
Nov 02 07:55:44 ip-172-17-182-225.ec2.internal kubelet[4330]: E1102 07:55:44.330937    4330 kubelet_node_status.go:539] "Error updating node status, will retry" err="error getting node \"ip-172-17-182-225.ec2.internal\": Get \"https://123456.gr7.us-east-1.eks.amazonaws.com/api/v1/nodes/ip-172-17-182-225.ec2.internal?timeout=10s\": tls: failed to verify certificate: x509: certificate signed by unknown authority"
Nov 02 07:55:44 ip-172-17-182-225.ec2.internal kubelet[4330]: E1102 07:55:44.334953    4330 kubelet_node_status.go:539] "Error updating node status, will retry" err="error getting node \"ip-172-17-182-225.ec2.internal\": Get \"https://123456.gr7.us-east-1.eks.amazonaws.com/api/v1/nodes/ip-172-17-182-225.ec2.internal?timeout=10s\": tls: failed to verify certificate: x509: certificate signed by unknown authority"
Nov 02 07:55:44 ip-172-17-182-225.ec2.internal kubelet[4330]: E1102 07:55:44.338930    4330 kubelet_node_status.go:539] "Error updating node status, will retry" err="error getting node \"ip-172-17-182-225.ec2.internal\": Get \"https://123456.gr7.us-east-1.eks.amazonaws.com/api/v1/nodes/ip-172-17-182-225.ec2.internal?timeout=10s\": tls: failed to verify certificate: x509: certificate signed by unknown authority"
Nov 02 07:55:44 ip-172-17-182-225.ec2.internal kubelet[4330]: E1102 07:55:44.338945    4330 kubelet_node_status.go:526] "Unable to update node status" err="update node status exceeds retry count"
Nov 02 07:55:47 ip-172-17-182-225.ec2.internal kubelet[4330]: E1102 07:55:47.403782    4330 controller.go:144] failed to ensure lease exists, will retry in 7s, error: Get "https://123456.gr7.us-east-1.eks.amazonaws.com/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/ip-172-17-182-225.ec2.internal?timeout=10s": tls: failed to verify certificate: x509: certificate signed by unknown authority
Nov 02 11:09:58 ip-172-17-182-225.ec2.internal kubelet[4330]: I1102 11:09:58.521341    4330 log.go:194] http: TLS handshake error from 172.29.73.185:36194: no serving certificate available for the kubelet

Expected Behavior: All the nodes should be ready, if notready nodes come up, karpenter should recycle them or disrup them. Reproduction Steps (Please include YAML):

Versions:

hitsub2 commented 10 months ago

After changing amiFamily from AL2 to Custom, it seems that there is no any noready nodes. So my question: what is the behavior when providing kubelet config via user data? Does the user data will be executed twice which caused this bug?

engedaam commented 10 months ago

This seems like a duplicate of Node repair. Since most of the nodes (398/400) became ready, it seems like a transit error was the problem in this case.

hitsub2 commented 10 months ago

It is the responsibility for karpenter to do the node repair, but I just wondering why this happens, does it due to the two time running of user-data?

engedaam commented 10 months ago

I suspect its not due to userData, as most of the nodes are ready

engedaam commented 10 months ago

Closing as a duplicate of https://github.com/aws/karpenter-core/issues/750

sidewinder12s commented 3 weeks ago

@hitsub2 Just wondering, where you following a guide or something else for working with these flags?

--enforce-node-allocatable=pods,kube-reserved,system-reserved --system-reserved-cgroup=/system.slice --kube-reserved-cgroup=/system.slice