Closed infestonn closed 2 years ago
@infestonn I believe you are running into the same root cause as #628 . Are there taints on the nodes that come up that prevent kube-scheduler from binding the pods which cause Karpenter to provision new capacity?
@tzneal No, we don't have taints in matching provisioner "default" as you can see. I must say, this is not the only provisioner we have. There're 3 more provisioners which have taints, but they do not match those pods I described.
Sorry, I was referring to taints that Istio or something else may be putting on the node. Does kubectl get events show why the pod was evicted? Karpenter shouldn't schedule new capacity unless for some reason kube-scheduler evicts and marks the pod as unschedulable again.
taints:
- effect: NoSchedule
key: karpenter.sh/not-ready
- effect: NoSchedule
key: node.kubernetes.io/not-ready
timeAdded: "2022-04-19T07:50:11Z"
- effect: NoExecute
key: node.kubernetes.io/not-ready
timeAdded: "2022-04-19T07:50:12Z"
This are the taints I see while node is being provisioned, I believe this is common taints for a node which is joining a cluster.
At the time pod fails there's no any taints.
To be accurate pods are not "Evicted", their statuses are Init:ContainerStatusUnknown
What is the restart policy of your delpoyment? If it's set to 'Always', the init container should restart (see https://kubernetes.io/docs/concepts/workloads/pods/init-containers/) .
Regarding your suggested change, Karpenter currently binds pods to nodes before they become ready. This allows us to enforce our bin-packing decisions. However, we are actively investigating not doing this pre-binding.
What is the restart policy of your delpoyment? If it's set to 'Always', the init container should restart (see https://kubernetes.io/docs/concepts/workloads/pods/init-containers/) .
I already had a small research regarding restart policy. The sad part that deployment's restartPolicy: Always
which makes thing more confusing.
All nodes below are empty. 8
pods on each node is just daemonests' pods.
Pod's requests: CPU=5200m, MEM=24960Mi
Any deployment/sts with init container which completes with exit code > 0
I was not quite right. Running a deployment with init container whose exit status is unsuccessful results in restarting init container as it should (https://kubernetes.io/docs/concepts/workloads/pods/init-containers/)
Whereas istio-validation container fails with a bit different status – Init:ContainerStatusUnknown
and a message: "The container could not be located when the pod was terminated". In this case restart policy doesn't work – this is why a new pod is created and karpenter start provisioning a new node.
The question "why a pod doesn't scheduled on an empty node" is still opened.
@infestonn Is this a node that has not become ready yet, or an existing ready node in your cluster? If it's an existing ready node in your cluster, then kube-scheduler is responsible for scheduling those pods. You can look at the output of kubectl describe pod pod-name
to see why it failed to schedule.
@infestonn Is this a node that has not become ready yet, or an existing ready node in your cluster? If it's an existing ready node in your cluster, then kube-scheduler is responsible for scheduling those pods. You can look at the output of
kubectl describe pod pod-name
to see why it failed to schedule.
Existing ones. I agree with you that kube-scheduler is responsible for that. But Im afraid I cannot get any useful info from describing a pod. Once init container fails the pod dies and replicaset controller creates a new one, which is picked by the karpenter and instantly bounds it to a new node. I wonder if there is any mechanism in karpenter which somehow breaks a regular k8s scheduling workflow? PS: we don't have any custom schedulers.
There shouldn't be, does describing the pod not show any events?
Here's an example of a pod's events
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 89s (x2 over 91s) default-scheduler 0/59 nodes are available: 2 node(s) had taint {karpenter: true}, that the pod didn't tolerate, 3 node(s) didn't match pod topology spread constraints, 32 Insufficient cpu, 40 Insufficient memory, 5 node(s) had taint {prometheus-workload: true}, that the pod didn't tolerate, 6 node(s) had taint {infrastructure-workload: true}, that the pod didn't tolerate.
Normal SuccessfulAttachVolume 33s attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-243d412c-970e-11eb-a6af-0251424bfb4b"
Warning NetworkNotReady 30s (x8 over 44s) kubelet network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
Warning FailedCreatePodSandBox 27s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "e199716a4a246514e51f25a92e6303286fe690c0ec12cfe1e6194bbf190b3cf7" network for pod "apod-es-7": networkPlugin cni failed to set up pod "apod-es-7_apod" network: stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/
Normal SandboxChanged 25s kubelet Pod sandbox changed, it will be killed and re-created.
Normal Pulling 24s kubelet Pulling image "111111111111123.dkr.ecr.us-west-1.amazonaws.com/istio/proxyv2:1.12.3"
Normal Pulled 19s kubelet Successfully pulled image "111111111111123.dkr.ecr.us-west-1.amazonaws.com/istio/proxyv2:1.12.3" in 5.207504782s
Normal Created 6s (x2 over 12s) kubelet Created container istio-validation
Normal Started 6s (x2 over 12s) kubelet Started container istio-validation
Normal Pulled 6s kubelet Container image "111111111111123.dkr.ecr.us-west-1.amazonaws.com/istio/proxyv2:1.12.3" already present on machine
Is there a helm chart I can use to try to replicate this? Those events make it look like the pod is going to run as the last message is about pulling the image.
The pod does run. The issue here is that it fails and controller creates a new one. The status of istio-validation container right after it exists with error.
State: Terminated
Reason: ContainerStatusUnknown
Message: The container could not be located when the pod was terminated
Exit Code: 137
Started: Mon, 01 Jan 0001 00:00:00 +0000
Finished: Mon, 01 Jan 0001 00:00:00 +0000
Last State: Terminated
Reason: Error
Exit Code: 126
Started: Tue, 26 Apr 2022 16:16:28 +0300
Finished: Tue, 26 Apr 2022 16:16:33 +0300
The container could not be located when the pod was terminated I believe this is the root cause why restartPolicy: Always
doesn't work.
I also had a try to workaround this by adding a delay to calico-node pod, so my newly joined node was in "NotReady" status due to NetworkUnavailable=true
for ~120s. I hoped that would let isitio-cni and other daemonsets to complete bootstrapping. Unfortunately nothing had changed. A container is created on a node before CNI is initialized which breaks its networking and only recreating a container does help, not restarting it. But it is just my guess.
Is there a helm chart I can use to try to replicate this?
I will try to compose a minimum set of components to replicate this.
@tzneal Im unable to fully reproduce the issue 😞 But the overall situation is more clear to me now. I know what exactly makes pods to fail – race condition described here
There is a time gap between a node becomes schedulable and the Istio CNI plugin becomes ready on that node. If an application pod starts up during this time, it is possible that traffic redirection is not properly set up and traffic would be able to bypass the Istio sidecar. This race condition is mitigated by a “detect and repair” method.
We have istio installed with CNI
cat <<EOF | istioctl install -y -f -
apiVersion: null
kind: IstioOperator
spec:
hub: istio
components:
cni:
enabled: true
namespace: istio-system
pilot:
k8s:
tolerations:
- key: infrastructure-workload
operator: Equal
value: "true"
effect: NoSchedule
meshConfig:
defaultConfig:
holdApplicationUntilProxyStarts: true
proxyMetadata:
ISTIO_META_DNS_CAPTURE: "true"
values:
cni:
excludeNamespaces:
- istio-system
- kube-system
repair:
deletePods: true
global:
proxy:
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 2000m
memory: 2048Mi
lifecycle:
preStop:
exec:
command:
- /bin/sh
- -c
- curl -X POST localhost:15000/drain_listeners?inboundonly; while [ $(netstat -plunt | grep tcp | grep -v envoy | grep -v pilot-agent | wc -l | xargs) -ne 0 ]; do sleep 1; done
EOF
To mitigate the race between an application pod and the Istio CNI DaemonSet, an istio-validation init container is added as part of the sidecar injection, which detects if traffic redirection is set up correctly, and blocks the pod starting up if not. The CNI DaemonSet will detect and evict any pod stuck in such state. When the new pod starts up, it should have traffic redirection set up properly. This mitigation is enabled by default and can be turned off by setting
values.cni.repair.enabled
to false.
So this flag deletePods: true
does exactly what it must do. Changing it to false
does not help. istio-validation
initContainer starts following restart policy described here but keeps failing over and over again due to broken container network.
This is my test statefull set
cat <<EOF | kubectl apply -f -
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: inflate
spec:
replicas: 3
selector:
matchLabels:
app: inflate
serviceName: inflate
template:
metadata:
labels:
app: inflate
spec:
terminationGracePeriodSeconds: 120
containers:
- name: inflate
image: public.ecr.aws/eks-distro/kubernetes/pause:3.2
resources:
requests:
cpu: 3
topologySpreadConstraints:
- labelSelector:
matchLabels:
app: inflate
maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
volumeClaimTemplates:
- apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: inflate-models
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Gi
volumeMode: Filesystem
EOF
Its spec is the same as our application from production env. The pods of this sts fail the same way with Init:ContainerStatusUnknown
error but the difference is that they instantly rescheduled on the same node and provisioning loop never takes place 😞
@infestonn I think this is similar to a Cilium issue that we are looking at ways to solve. Cilium wants a taint to be placed on the node at startup, and then it removes the taint when the node networking is fully configured. Can Istio notify us by removing a taint? Or does it just use node readiness?
@infestonn I think this is similar to a Cilium issue that we are looking at ways to solve. Cilium wants a taint to be placed on the node at startup, and then it removes the taint when the node networking is fully configured. Can Istio notify us by removing a taint? Or does it just use node readiness?
Related: #1727
From my knowledge it relies on node readiness. A main CNI(calico, aws-cni etc.) is in charge of network configuration on a node and updates node status when it is ready. Only after that as a chained cni, istio pod starts running as well as the rest of the pods on a node.
Note: The Istio CNI plugin operates as a chained CNI plugin, and it is designed to be used with another CNI plugin, such as PTP or Calico. See compatibility with other CNI plugins for details.
@infestonn Thanks for the info. We are actively investigating not binding pods to nodes. In this case, we would just launch the node and allow kube-scheduler to bind pods after the node has become ready. It should avoid the issue that you are seeing where the pod is bound before the node is ready and initialization fails.
Labeled for closure due to inactivity in 10 days.
Version
Karpenter: v0.8.2
Kubernetes: v1.21.5
Context
We are using calico CNI + istio in our clusters. istio-validation is an init container injected in all our pods. When a pod is assigned to a node the first run of istio-validation always fails with the error
Init:ContainerStatusUnknown
(because init container starts executing before istio-cni-node-xxx is ready) this in turn causes a pod to be rescheduled on another node. I understand that failed init container is our bad, but there might be any other circumstances which causes pod to fail at init stage.Actual Behavior
karpenter starts provisioning a node with the same capacity to assign this pod to it, despite there's a lot of empty(with available capacity) nodes provisioned a few minutes ago -
ttlSecondsAfterEmpty: 360
Expected Behavior
Karpenter does not start provisioning a new node and assign recreated pod to it, but let default scheduler move this pod onto other nodes with available capacity OR lets restart/recreation of a pod on the same node.
Steps to Reproduce the Problem
Any deployment/sts with init container which completes with exit code >0
Resource Specs and Logs
No noticeable messages found in the logs