Endless provisioning loop of new node

infestonn commented 2 years ago

Version

Karpenter: v0.8.2

Kubernetes: v1.21.5

Context

We are using calico CNI + istio in our clusters. istio-validation is an init container injected in all our pods. When a pod is assigned to a node the first run of istio-validation always fails with the error Init:ContainerStatusUnknown(because init container starts executing before istio-cni-node-xxx is ready) this in turn causes a pod to be rescheduled on another node. I understand that failed init container is our bad, but there might be any other circumstances which causes pod to fail at init stage.

Actual Behavior

karpenter starts provisioning a node with the same capacity to assign this pod to it, despite there's a lot of empty(with available capacity) nodes provisioned a few minutes ago - ttlSecondsAfterEmpty: 360

Expected Behavior

Karpenter does not start provisioning a new node and assign recreated pod to it, but let default scheduler move this pod onto other nodes with available capacity OR lets restart/recreation of a pod on the same node.

Steps to Reproduce the Problem

~~Any deployment/sts with init container which completes with exit code > 0~~

Resource Specs and Logs

No noticeable messages found in the logs


apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  annotations:
    helm.sh/hook: post-install,post-upgrade
    helm.sh/resource-policy: keep
  generation: 1
  name: default
spec:
  labels:
    k8s.example.com/cluster: xyz-eks
    k8s.example.com/environment: production
  provider:
    launchTemplate: xyz-eks-karpenter20220401093308003100000003
    subnetSelector:
      Name: us-west-1-prod-vpc-subnet-k8s-us-west-1*
    tags:
      Environment: production
      Name: xyz-eks-default-karpenter-provider
      Team: DevOps
  requirements:
  - key: node.kubernetes.io/instance-type
    operator: NotIn
    values:
    - t3.nano
    - t3.xlarge
    - t3.2xlarge
    - t3.medium
    - t3.micro
    - t3.small
    - t3.large
    - t3a.nano
    - t3a.xlarge
    - t3a.2xlarge
    - t3a.medium
    - t3a.micro
    - t3a.small
    - t3a.large
  - key: karpenter.sh/capacity-type
    operator: In
    values:
    - spot
  - key: topology.kubernetes.io/zone
    operator: In
    values:
    - us-west-1b
    - us-west-1c
  - key: kubernetes.io/arch
    operator: In
    values:
    - amd64
  ttlSecondsAfterEmpty: 360
  ttlSecondsUntilExpired: 1209600

tzneal commented 2 years ago

@infestonn I believe you are running into the same root cause as #628 . Are there taints on the nodes that come up that prevent kube-scheduler from binding the pods which cause Karpenter to provision new capacity?

infestonn commented 2 years ago

@tzneal No, we don't have taints in matching provisioner "default" as you can see. I must say, this is not the only provisioner we have. There're 3 more provisioners which have taints, but they do not match those pods I described.

tzneal commented 2 years ago

Sorry, I was referring to taints that Istio or something else may be putting on the node. Does kubectl get events show why the pod was evicted? Karpenter shouldn't schedule new capacity unless for some reason kube-scheduler evicts and marks the pod as unschedulable again.

infestonn commented 2 years ago

   taints:
   - effect: NoSchedule
     key: karpenter.sh/not-ready
   - effect: NoSchedule
     key: node.kubernetes.io/not-ready
     timeAdded: "2022-04-19T07:50:11Z"
   - effect: NoExecute
     key: node.kubernetes.io/not-ready
     timeAdded: "2022-04-19T07:50:12Z"

This are the taints I see while node is being provisioned, I believe this is common taints for a node which is joining a cluster. At the time pod fails there's no any taints. To be accurate pods are not "Evicted", their statuses are Init:ContainerStatusUnknown

tzneal commented 2 years ago

What is the restart policy of your delpoyment? If it's set to 'Always', the init container should restart (see https://kubernetes.io/docs/concepts/workloads/pods/init-containers/) .

Regarding your suggested change, Karpenter currently binds pods to nodes before they become ready. This allows us to enforce our bin-packing decisions. However, we are actively investigating not doing this pre-binding.

infestonn commented 2 years ago

What is the restart policy of your delpoyment? If it's set to 'Always', the init container should restart (see https://kubernetes.io/docs/concepts/workloads/pods/init-containers/) .

I already had a small research regarding restart policy. The sad part that deployment's restartPolicy: Always which makes thing more confusing.

infestonn commented 2 years ago

All nodes below are empty. 8 pods on each node is just daemonests' pods. Pod's requests: CPU=5200m, MEM=24960Mi

infestonn commented 2 years ago

Any deployment/sts with init container which completes with exit code > 0

I was not quite right. Running a deployment with init container whose exit status is unsuccessful results in restarting init container as it should (https://kubernetes.io/docs/concepts/workloads/pods/init-containers/) Whereas istio-validation container fails with a bit different status – Init:ContainerStatusUnknown and a message: "The container could not be located when the pod was terminated". In this case restart policy doesn't work – this is why a new pod is created and karpenter start provisioning a new node.

infestonn commented 2 years ago

The question "why a pod doesn't scheduled on an empty node" is still opened.

tzneal commented 2 years ago

@infestonn Is this a node that has not become ready yet, or an existing ready node in your cluster? If it's an existing ready node in your cluster, then kube-scheduler is responsible for scheduling those pods. You can look at the output of kubectl describe pod pod-name to see why it failed to schedule.

infestonn commented 2 years ago

@infestonn Is this a node that has not become ready yet, or an existing ready node in your cluster? If it's an existing ready node in your cluster, then kube-scheduler is responsible for scheduling those pods. You can look at the output of kubectl describe pod pod-name to see why it failed to schedule.

Existing ones. I agree with you that kube-scheduler is responsible for that. But Im afraid I cannot get any useful info from describing a pod. Once init container fails the pod dies and replicaset controller creates a new one, which is picked by the karpenter and instantly bounds it to a new node. I wonder if there is any mechanism in karpenter which somehow breaks a regular k8s scheduling workflow? PS: we don't have any custom schedulers.

tzneal commented 2 years ago

There shouldn't be, does describing the pod not show any events?

infestonn commented 2 years ago

Here's an example of a pod's events


Events:
  Type     Reason                  Age                From                     Message
  ----     ------                  ----               ----                     -------
  Warning  FailedScheduling        89s (x2 over 91s)  default-scheduler        0/59 nodes are available: 2 node(s) had taint {karpenter: true}, that the pod didn't tolerate, 3 node(s) didn't match pod topology spread constraints, 32 Insufficient cpu, 40 Insufficient memory, 5 node(s) had taint {prometheus-workload: true}, that the pod didn't tolerate, 6 node(s) had taint {infrastructure-workload: true}, that the pod didn't tolerate.
  Normal   SuccessfulAttachVolume  33s                attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-243d412c-970e-11eb-a6af-0251424bfb4b"
  Warning  NetworkNotReady         30s (x8 over 44s)  kubelet                  network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
  Warning  FailedCreatePodSandBox  27s                kubelet                  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "e199716a4a246514e51f25a92e6303286fe690c0ec12cfe1e6194bbf190b3cf7" network for pod "apod-es-7": networkPlugin cni failed to set up pod "apod-es-7_apod" network: stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/
  Normal   SandboxChanged          25s                kubelet                  Pod sandbox changed, it will be killed and re-created.
  Normal   Pulling                 24s                kubelet                  Pulling image "111111111111123.dkr.ecr.us-west-1.amazonaws.com/istio/proxyv2:1.12.3"
  Normal   Pulled                  19s                kubelet                  Successfully pulled image "111111111111123.dkr.ecr.us-west-1.amazonaws.com/istio/proxyv2:1.12.3" in 5.207504782s
  Normal   Created                 6s (x2 over 12s)   kubelet                  Created container istio-validation
  Normal   Started                 6s (x2 over 12s)   kubelet                  Started container istio-validation
  Normal   Pulled                  6s                 kubelet                  Container image "111111111111123.dkr.ecr.us-west-1.amazonaws.com/istio/proxyv2:1.12.3" already present on machine

tzneal commented 2 years ago

Is there a helm chart I can use to try to replicate this? Those events make it look like the pod is going to run as the last message is about pulling the image.

infestonn commented 2 years ago

The pod does run. The issue here is that it fails and controller creates a new one. The status of istio-validation container right after it exists with error.

    State:          Terminated
      Reason:       ContainerStatusUnknown
      Message:      The container could not be located when the pod was terminated
      Exit Code:    137
      Started:      Mon, 01 Jan 0001 00:00:00 +0000
      Finished:     Mon, 01 Jan 0001 00:00:00 +0000
    Last State:     Terminated
      Reason:       Error
      Exit Code:    126
      Started:      Tue, 26 Apr 2022 16:16:28 +0300
      Finished:     Tue, 26 Apr 2022 16:16:33 +0300

~~The container could not be located when the pod was terminated I believe this is the root cause why restartPolicy: Always doesn't work.~~

I also had a try to workaround this by adding a delay to calico-node pod, so my newly joined node was in "NotReady" status due to NetworkUnavailable=true for ~120s. I hoped that would let isitio-cni and other daemonsets to complete bootstrapping. Unfortunately nothing had changed. A container is created on a node before CNI is initialized which breaks its networking and only recreating a container does help, not restarting it. But it is just my guess.

Is there a helm chart I can use to try to replicate this?

I will try to compose a minimum set of components to replicate this.

infestonn commented 2 years ago

@tzneal Im unable to fully reproduce the issue 😞 But the overall situation is more clear to me now. I know what exactly makes pods to fail – race condition described here

There is a time gap between a node becomes schedulable and the Istio CNI plugin becomes ready on that node. If an application pod starts up during this time, it is possible that traffic redirection is not properly set up and traffic would be able to bypass the Istio sidecar. This race condition is mitigated by a “detect and repair” method.

We have istio installed with CNI

cat <<EOF | istioctl install -y -f -
apiVersion: null
kind: IstioOperator
spec:
  hub: istio
  components:
    cni:
      enabled: true
      namespace: istio-system
    pilot:
      k8s:
        tolerations:
        - key: infrastructure-workload
          operator: Equal
          value: "true"
          effect: NoSchedule
  meshConfig:
    defaultConfig:
      holdApplicationUntilProxyStarts: true
      proxyMetadata:
        ISTIO_META_DNS_CAPTURE: "true"
  values:
    cni:
      excludeNamespaces:
      - istio-system
      - kube-system
      repair:
        deletePods: true
    global:
      proxy:
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
          limits:
            cpu: 2000m
            memory: 2048Mi
        lifecycle:
          preStop:
            exec:
              command:
              - /bin/sh
              - -c
              - curl -X POST localhost:15000/drain_listeners?inboundonly; while [ $(netstat -plunt | grep tcp | grep -v envoy | grep -v pilot-agent | wc -l | xargs) -ne 0 ]; do sleep 1; done
EOF

To mitigate the race between an application pod and the Istio CNI DaemonSet, an istio-validation init container is added as part of the sidecar injection, which detects if traffic redirection is set up correctly, and blocks the pod starting up if not. The CNI DaemonSet will detect and evict any pod stuck in such state. When the new pod starts up, it should have traffic redirection set up properly. This mitigation is enabled by default and can be turned off by setting values.cni.repair.enabled to false.

So this flag deletePods: true does exactly what it must do. Changing it to false does not help. istio-validation initContainer starts following restart policy described here but keeps failing over and over again due to broken container network.

This is my test statefull set

cat <<EOF | kubectl apply -f -
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: inflate
spec:
  replicas: 3
  selector:
    matchLabels:
      app: inflate
  serviceName: inflate
  template:
    metadata:
      labels:
        app: inflate
    spec:
      terminationGracePeriodSeconds: 120
      containers:
        - name: inflate
          image: public.ecr.aws/eks-distro/kubernetes/pause:3.2
          resources:
            requests:
              cpu: 3
      topologySpreadConstraints:
      - labelSelector:
          matchLabels:
            app: inflate
        maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: DoNotSchedule
  volumeClaimTemplates:
  - apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: inflate-models
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 1Gi
      volumeMode: Filesystem
EOF

Its spec is the same as our application from production env. The pods of this sts fail the same way with Init:ContainerStatusUnknown error but the difference is that they instantly rescheduled on the same node and provisioning loop never takes place 😞

tzneal commented 2 years ago

@infestonn I think this is similar to a Cilium issue that we are looking at ways to solve. Cilium wants a taint to be placed on the node at startup, and then it removes the taint when the node networking is fully configured. Can Istio notify us by removing a taint? Or does it just use node readiness?

infestonn commented 2 years ago

@infestonn I think this is similar to a Cilium issue that we are looking at ways to solve. Cilium wants a taint to be placed on the node at startup, and then it removes the taint when the node networking is fully configured. Can Istio notify us by removing a taint? Or does it just use node readiness?

Related: #1727

From my knowledge it relies on node readiness. A main CNI(calico, aws-cni etc.) is in charge of network configuration on a node and updates node status when it is ready. Only after that as a chained cni, istio pod starts running as well as the rest of the pods on a node.

Note: The Istio CNI plugin operates as a chained CNI plugin, and it is designed to be used with another CNI plugin, such as PTP or Calico. See compatibility with other CNI plugins for details.

tzneal commented 2 years ago

@infestonn Thanks for the info. We are actively investigating not binding pods to nodes. In this case, we would just launch the node and allow kube-scheduler to bind pods after the node has become ready. It should avoid the issue that you are seeing where the pod is bound before the node is ready and initialization fails.

github-actions[bot] commented 2 years ago

Labeled for closure due to inactivity in 10 days.

aws / karpenter-provider-aws