Closed Hong-Chang closed 2 years ago
I tried the change and still see the issue. CNI logs have the same problem.
root@vinay-test-scaleup-master:~# kubectl get po -AT -owide
TENANT NAMESPACE NAME HASHKEY READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
system default mizar-daemon-d4wlb 6200696760787091141 1/1 Running 0 61m 10.40.0.3 vinay-test-scaleup-minion-group-clqb <none> <none>
system default mizar-daemon-svvjt 176257927353498585 1/1 Running 0 61m 10.40.0.2 vinay-test-scaleup-master <none> <none>
system default mizar-operator-7c999fdc5d-fd48s 283161499659780697 1/1 Running 0 61m 10.40.0.2 vinay-test-scaleup-master <none> <none>
system default netpod1 5756151989355618511 1/1 Running 0 49m 175.172.0.30 vinay-test-scaleup-minion-group-clqb <none> <none>
system default netpod2 7921576421427798107 0/1 Pending 0 49m <none> <none> <none> <none>
system kube-system arktos-network-controller-vinay-test-scaleup-master 3814323996954979909 1/1 Running 0 61m 10.40.0.2 vinay-test-scaleup-master <none> <none>
system kube-system coredns-75c65c444f-t4vdc 4032077812886267571 0/1 ContainerCreating 0 61m <none> vinay-test-scaleup-minion-group-clqb <none> <none>
system kube-system coredns-default-689ccbb987-bvjpf 8522204212573833422 0/1 ContainerCreating 0 61m <none> vinay-test-scaleup-minion-group-clqb <none> <none>
system kube-system etcd-empty-dir-cleanup-vinay-test-scaleup-master 5209665295957676882 1/1 Running 0 60m 10.40.0.2 vinay-test-scaleup-master <none> <none>
system kube-system etcd-server-events-vinay-test-scaleup-master 2743859338809797379 1/1 Running 0 60m 10.40.0.2 vinay-test-scaleup-master <none> <none>
system kube-system etcd-server-vinay-test-scaleup-master 3707224456605576163 1/1 Running 0 60m 10.40.0.2 vinay-test-scaleup-master <none> <none>
system kube-system event-exporter-v0.2.5-868dff6494-brgdf 2852235485967061916 0/1 ContainerCreating 0 61m <none> vinay-test-scaleup-minion-group-clqb <none> <none>
system kube-system fluentd-gcp-scaler-74b46b8776-5gmnc 4347325301847180329 0/1 ContainerCreating 0 61m <none> vinay-test-scaleup-minion-group-clqb <none> <none>
system kube-system fluentd-gcp-v3.2.0-67jwq 1061354830373000902 1/1 Running 0 61m 10.40.0.3 vinay-test-scaleup-minion-group-clqb <none> <none>
system kube-system fluentd-gcp-v3.2.0-z42fr 1565803476226196832 1/1 Running 0 61m 10.40.0.2 vinay-test-scaleup-master <none> <none>
system kube-system heapster-v1.6.0-beta.1-57874ccf9d-7phrz 6940853441142956705 0/2 ContainerCreating 0 61m <none> vinay-test-scaleup-minion-group-clqb <none> <none>
system kube-system kube-addon-manager-vinay-test-scaleup-master 5014754618061431440 1/1 Running 0 60m 10.40.0.2 vinay-test-scaleup-master <none> <none>
system kube-system kube-apiserver-vinay-test-scaleup-master 2973271291999151932 1/1 Running 0 61m 10.40.0.2 vinay-test-scaleup-master <none> <none>
system kube-system kube-controller-manager-vinay-test-scaleup-master 3002757412526360181 1/1 Running 0 60m 10.40.0.2 vinay-test-scaleup-master <none> <none>
system kube-system kube-dns-autoscaler-748b78969c-7vfm7 286561218688347397 0/1 ContainerCreating 0 61m <none> vinay-test-scaleup-minion-group-clqb <none> <none>
system kube-system kube-proxy-vinay-test-scaleup-master 2919100408821379765 1/1 Running 0 60m 10.40.0.2 vinay-test-scaleup-master <none> <none>
system kube-system kube-proxy-vinay-test-scaleup-minion-group-clqb 2919100408821379765 1/1 Running 0 61m 10.40.0.3 vinay-test-scaleup-minion-group-clqb <none> <none>
system kube-system kube-scheduler-vinay-test-scaleup-master 8466941937980522938 1/1 Running 1 61m 10.40.0.2 vinay-test-scaleup-master <none> <none>
system kube-system kubernetes-dashboard-848965699-ggfzm 5093790579082669371 0/1 ContainerCreating 0 61m <none> vinay-test-scaleup-minion-group-clqb <none> <none>
system kube-system l7-default-backend-6497bc5bf6-9vm9s 3892533580702677450 0/1 ContainerCreating 0 61m <none> vinay-test-scaleup-minion-group-clqb <none> <none>
system kube-system l7-lb-controller-v1.2.3-vinay-test-scaleup-master 5013348416365870850 1/1 Running 0 60m 10.40.0.2 vinay-test-scaleup-master <none> <none>
system kube-system metrics-server-v0.3.3-5f994fcb77-6zjtd 5954466197702541044 0/2 ContainerCreating 0 61m <none> vinay-test-scaleup-minion-group-clqb <none> <none>
root@vinay-test-scaleup-master:~#
I tried the change and still see the issue. CNI logs have the same problem.
root@vinay-test-scaleup-master:~# kubectl get po -AT -owide TENANT NAMESPACE NAME HASHKEY READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES system default mizar-daemon-d4wlb 6200696760787091141 1/1 Running 0 61m 10.40.0.3 vinay-test-scaleup-minion-group-clqb <none> <none> system default mizar-daemon-svvjt 176257927353498585 1/1 Running 0 61m 10.40.0.2 vinay-test-scaleup-master <none> <none> system default mizar-operator-7c999fdc5d-fd48s 283161499659780697 1/1 Running 0 61m 10.40.0.2 vinay-test-scaleup-master <none> <none> system default netpod1 5756151989355618511 1/1 Running 0 49m 175.172.0.30 vinay-test-scaleup-minion-group-clqb <none> <none> system default netpod2 7921576421427798107 0/1 Pending 0 49m <none> <none> <none> <none> system kube-system arktos-network-controller-vinay-test-scaleup-master 3814323996954979909 1/1 Running 0 61m 10.40.0.2 vinay-test-scaleup-master <none> <none> system kube-system coredns-75c65c444f-t4vdc 4032077812886267571 0/1 ContainerCreating 0 61m <none> vinay-test-scaleup-minion-group-clqb <none> <none> system kube-system coredns-default-689ccbb987-bvjpf 8522204212573833422 0/1 ContainerCreating 0 61m <none> vinay-test-scaleup-minion-group-clqb <none> <none> system kube-system etcd-empty-dir-cleanup-vinay-test-scaleup-master 5209665295957676882 1/1 Running 0 60m 10.40.0.2 vinay-test-scaleup-master <none> <none> system kube-system etcd-server-events-vinay-test-scaleup-master 2743859338809797379 1/1 Running 0 60m 10.40.0.2 vinay-test-scaleup-master <none> <none> system kube-system etcd-server-vinay-test-scaleup-master 3707224456605576163 1/1 Running 0 60m 10.40.0.2 vinay-test-scaleup-master <none> <none> system kube-system event-exporter-v0.2.5-868dff6494-brgdf 2852235485967061916 0/1 ContainerCreating 0 61m <none> vinay-test-scaleup-minion-group-clqb <none> <none> system kube-system fluentd-gcp-scaler-74b46b8776-5gmnc 4347325301847180329 0/1 ContainerCreating 0 61m <none> vinay-test-scaleup-minion-group-clqb <none> <none> system kube-system fluentd-gcp-v3.2.0-67jwq 1061354830373000902 1/1 Running 0 61m 10.40.0.3 vinay-test-scaleup-minion-group-clqb <none> <none> system kube-system fluentd-gcp-v3.2.0-z42fr 1565803476226196832 1/1 Running 0 61m 10.40.0.2 vinay-test-scaleup-master <none> <none> system kube-system heapster-v1.6.0-beta.1-57874ccf9d-7phrz 6940853441142956705 0/2 ContainerCreating 0 61m <none> vinay-test-scaleup-minion-group-clqb <none> <none> system kube-system kube-addon-manager-vinay-test-scaleup-master 5014754618061431440 1/1 Running 0 60m 10.40.0.2 vinay-test-scaleup-master <none> <none> system kube-system kube-apiserver-vinay-test-scaleup-master 2973271291999151932 1/1 Running 0 61m 10.40.0.2 vinay-test-scaleup-master <none> <none> system kube-system kube-controller-manager-vinay-test-scaleup-master 3002757412526360181 1/1 Running 0 60m 10.40.0.2 vinay-test-scaleup-master <none> <none> system kube-system kube-dns-autoscaler-748b78969c-7vfm7 286561218688347397 0/1 ContainerCreating 0 61m <none> vinay-test-scaleup-minion-group-clqb <none> <none> system kube-system kube-proxy-vinay-test-scaleup-master 2919100408821379765 1/1 Running 0 60m 10.40.0.2 vinay-test-scaleup-master <none> <none> system kube-system kube-proxy-vinay-test-scaleup-minion-group-clqb 2919100408821379765 1/1 Running 0 61m 10.40.0.3 vinay-test-scaleup-minion-group-clqb <none> <none> system kube-system kube-scheduler-vinay-test-scaleup-master 8466941937980522938 1/1 Running 1 61m 10.40.0.2 vinay-test-scaleup-master <none> <none> system kube-system kubernetes-dashboard-848965699-ggfzm 5093790579082669371 0/1 ContainerCreating 0 61m <none> vinay-test-scaleup-minion-group-clqb <none> <none> system kube-system l7-default-backend-6497bc5bf6-9vm9s 3892533580702677450 0/1 ContainerCreating 0 61m <none> vinay-test-scaleup-minion-group-clqb <none> <none> system kube-system l7-lb-controller-v1.2.3-vinay-test-scaleup-master 5013348416365870850 1/1 Running 0 60m 10.40.0.2 vinay-test-scaleup-master <none> <none> system kube-system metrics-server-v0.3.3-5f994fcb77-6zjtd 5954466197702541044 0/2 ContainerCreating 0 61m <none> vinay-test-scaleup-minion-group-clqb <none> <none> root@vinay-test-scaleup-master:~#
Yes, this is expected. The key issue to be resolved, is that the produced interface was popped out from map too early. If anything happens later than that, the pod cannot reach running state. I will use another pr to fix the issue. This pr is re-purposed. It cannot solve the pod stucking issue, but it resolves some issue we found during investigation. For example, in cni, handle panic and preserve log.
We are observing issue that system created pods may stuck in ContainerCreating, while normal pods we don't see the issue. For this issue, the major difference between system pods and normal pods is that system pods are created in very early stage, when system may not be ready, and mizar may not be ready.
When a pod is created, what happened in mizar is:
The ordering of 3 and 4 is the issue. We are doing it too early that remove the interface from the map. Let's assume if there is any exception happens in step 4, since the interface is removed, it has no chance to retry.
When pod is created in early stage, at that time mizar is not fully ready, or in a transition to ready state, it has chance to fail and throw exception in step 4.
The fix is to re-ordering step 3 and 4. After step 4 consuming interface succeed, we remove the interface from the map.
I tried the fix and the issue cannot repro.
What type of PR is this? /kind bug