CentaurusInfra / mizar

Mizar – Experimental, High Scale and High Performance Cloud Network https://mizar.readthedocs.io
https://mizar.readthedocs.io
GNU General Public License v2.0
111 stars 50 forks source link

Fix pod stuck in ContainerCreating issue #626

Closed Hong-Chang closed 2 years ago

Hong-Chang commented 2 years ago

We are observing issue that system created pods may stuck in ContainerCreating, while normal pods we don't see the issue. For this issue, the major difference between system pods and normal pods is that system pods are created in very early stage, when system may not be ready, and mizar may not be ready.

When a pod is created, what happened in mizar is:

  1. Produce interface, and put to a map, indicate the interface is produced
  2. check the map to see whether the interface has been produced
  3. If the interface is in the map, get it from the map, and remove it from the map.
  4. Consume interface which is from the map

The ordering of 3 and 4 is the issue. We are doing it too early that remove the interface from the map. Let's assume if there is any exception happens in step 4, since the interface is removed, it has no chance to retry.

When pod is created in early stage, at that time mizar is not fully ready, or in a transition to ready state, it has chance to fail and throw exception in step 4.

The fix is to re-ordering step 3 and 4. After step 4 consuming interface succeed, we remove the interface from the map.

I tried the fix and the issue cannot repro.

What type of PR is this? /kind bug

vinaykul commented 2 years ago

I tried the change and still see the issue. CNI logs have the same problem.

root@vinay-test-scaleup-master:~# kubectl get po -AT -owide
TENANT   NAMESPACE     NAME                                                  HASHKEY               READY   STATUS              RESTARTS   AGE   IP             NODE                                   NOMINATED NODE   READINESS GATES
system   default       mizar-daemon-d4wlb                                    6200696760787091141   1/1     Running             0          61m   10.40.0.3      vinay-test-scaleup-minion-group-clqb   <none>           <none>
system   default       mizar-daemon-svvjt                                    176257927353498585    1/1     Running             0          61m   10.40.0.2      vinay-test-scaleup-master              <none>           <none>
system   default       mizar-operator-7c999fdc5d-fd48s                       283161499659780697    1/1     Running             0          61m   10.40.0.2      vinay-test-scaleup-master              <none>           <none>
system   default       netpod1                                               5756151989355618511   1/1     Running             0          49m   175.172.0.30   vinay-test-scaleup-minion-group-clqb   <none>           <none>
system   default       netpod2                                               7921576421427798107   0/1     Pending             0          49m   <none>         <none>                                 <none>           <none>
system   kube-system   arktos-network-controller-vinay-test-scaleup-master   3814323996954979909   1/1     Running             0          61m   10.40.0.2      vinay-test-scaleup-master              <none>           <none>
system   kube-system   coredns-75c65c444f-t4vdc                              4032077812886267571   0/1     ContainerCreating   0          61m   <none>         vinay-test-scaleup-minion-group-clqb   <none>           <none>
system   kube-system   coredns-default-689ccbb987-bvjpf                      8522204212573833422   0/1     ContainerCreating   0          61m   <none>         vinay-test-scaleup-minion-group-clqb   <none>           <none>
system   kube-system   etcd-empty-dir-cleanup-vinay-test-scaleup-master      5209665295957676882   1/1     Running             0          60m   10.40.0.2      vinay-test-scaleup-master              <none>           <none>
system   kube-system   etcd-server-events-vinay-test-scaleup-master          2743859338809797379   1/1     Running             0          60m   10.40.0.2      vinay-test-scaleup-master              <none>           <none>
system   kube-system   etcd-server-vinay-test-scaleup-master                 3707224456605576163   1/1     Running             0          60m   10.40.0.2      vinay-test-scaleup-master              <none>           <none>
system   kube-system   event-exporter-v0.2.5-868dff6494-brgdf                2852235485967061916   0/1     ContainerCreating   0          61m   <none>         vinay-test-scaleup-minion-group-clqb   <none>           <none>
system   kube-system   fluentd-gcp-scaler-74b46b8776-5gmnc                   4347325301847180329   0/1     ContainerCreating   0          61m   <none>         vinay-test-scaleup-minion-group-clqb   <none>           <none>
system   kube-system   fluentd-gcp-v3.2.0-67jwq                              1061354830373000902   1/1     Running             0          61m   10.40.0.3      vinay-test-scaleup-minion-group-clqb   <none>           <none>
system   kube-system   fluentd-gcp-v3.2.0-z42fr                              1565803476226196832   1/1     Running             0          61m   10.40.0.2      vinay-test-scaleup-master              <none>           <none>
system   kube-system   heapster-v1.6.0-beta.1-57874ccf9d-7phrz               6940853441142956705   0/2     ContainerCreating   0          61m   <none>         vinay-test-scaleup-minion-group-clqb   <none>           <none>
system   kube-system   kube-addon-manager-vinay-test-scaleup-master          5014754618061431440   1/1     Running             0          60m   10.40.0.2      vinay-test-scaleup-master              <none>           <none>
system   kube-system   kube-apiserver-vinay-test-scaleup-master              2973271291999151932   1/1     Running             0          61m   10.40.0.2      vinay-test-scaleup-master              <none>           <none>
system   kube-system   kube-controller-manager-vinay-test-scaleup-master     3002757412526360181   1/1     Running             0          60m   10.40.0.2      vinay-test-scaleup-master              <none>           <none>
system   kube-system   kube-dns-autoscaler-748b78969c-7vfm7                  286561218688347397    0/1     ContainerCreating   0          61m   <none>         vinay-test-scaleup-minion-group-clqb   <none>           <none>
system   kube-system   kube-proxy-vinay-test-scaleup-master                  2919100408821379765   1/1     Running             0          60m   10.40.0.2      vinay-test-scaleup-master              <none>           <none>
system   kube-system   kube-proxy-vinay-test-scaleup-minion-group-clqb       2919100408821379765   1/1     Running             0          61m   10.40.0.3      vinay-test-scaleup-minion-group-clqb   <none>           <none>
system   kube-system   kube-scheduler-vinay-test-scaleup-master              8466941937980522938   1/1     Running             1          61m   10.40.0.2      vinay-test-scaleup-master              <none>           <none>
system   kube-system   kubernetes-dashboard-848965699-ggfzm                  5093790579082669371   0/1     ContainerCreating   0          61m   <none>         vinay-test-scaleup-minion-group-clqb   <none>           <none>
system   kube-system   l7-default-backend-6497bc5bf6-9vm9s                   3892533580702677450   0/1     ContainerCreating   0          61m   <none>         vinay-test-scaleup-minion-group-clqb   <none>           <none>
system   kube-system   l7-lb-controller-v1.2.3-vinay-test-scaleup-master     5013348416365870850   1/1     Running             0          60m   10.40.0.2      vinay-test-scaleup-master              <none>           <none>
system   kube-system   metrics-server-v0.3.3-5f994fcb77-6zjtd                5954466197702541044   0/2     ContainerCreating   0          61m   <none>         vinay-test-scaleup-minion-group-clqb   <none>           <none>
root@vinay-test-scaleup-master:~#
Hong-Chang commented 2 years ago

I tried the change and still see the issue. CNI logs have the same problem.

root@vinay-test-scaleup-master:~# kubectl get po -AT -owide
TENANT   NAMESPACE     NAME                                                  HASHKEY               READY   STATUS              RESTARTS   AGE   IP             NODE                                   NOMINATED NODE   READINESS GATES
system   default       mizar-daemon-d4wlb                                    6200696760787091141   1/1     Running             0          61m   10.40.0.3      vinay-test-scaleup-minion-group-clqb   <none>           <none>
system   default       mizar-daemon-svvjt                                    176257927353498585    1/1     Running             0          61m   10.40.0.2      vinay-test-scaleup-master              <none>           <none>
system   default       mizar-operator-7c999fdc5d-fd48s                       283161499659780697    1/1     Running             0          61m   10.40.0.2      vinay-test-scaleup-master              <none>           <none>
system   default       netpod1                                               5756151989355618511   1/1     Running             0          49m   175.172.0.30   vinay-test-scaleup-minion-group-clqb   <none>           <none>
system   default       netpod2                                               7921576421427798107   0/1     Pending             0          49m   <none>         <none>                                 <none>           <none>
system   kube-system   arktos-network-controller-vinay-test-scaleup-master   3814323996954979909   1/1     Running             0          61m   10.40.0.2      vinay-test-scaleup-master              <none>           <none>
system   kube-system   coredns-75c65c444f-t4vdc                              4032077812886267571   0/1     ContainerCreating   0          61m   <none>         vinay-test-scaleup-minion-group-clqb   <none>           <none>
system   kube-system   coredns-default-689ccbb987-bvjpf                      8522204212573833422   0/1     ContainerCreating   0          61m   <none>         vinay-test-scaleup-minion-group-clqb   <none>           <none>
system   kube-system   etcd-empty-dir-cleanup-vinay-test-scaleup-master      5209665295957676882   1/1     Running             0          60m   10.40.0.2      vinay-test-scaleup-master              <none>           <none>
system   kube-system   etcd-server-events-vinay-test-scaleup-master          2743859338809797379   1/1     Running             0          60m   10.40.0.2      vinay-test-scaleup-master              <none>           <none>
system   kube-system   etcd-server-vinay-test-scaleup-master                 3707224456605576163   1/1     Running             0          60m   10.40.0.2      vinay-test-scaleup-master              <none>           <none>
system   kube-system   event-exporter-v0.2.5-868dff6494-brgdf                2852235485967061916   0/1     ContainerCreating   0          61m   <none>         vinay-test-scaleup-minion-group-clqb   <none>           <none>
system   kube-system   fluentd-gcp-scaler-74b46b8776-5gmnc                   4347325301847180329   0/1     ContainerCreating   0          61m   <none>         vinay-test-scaleup-minion-group-clqb   <none>           <none>
system   kube-system   fluentd-gcp-v3.2.0-67jwq                              1061354830373000902   1/1     Running             0          61m   10.40.0.3      vinay-test-scaleup-minion-group-clqb   <none>           <none>
system   kube-system   fluentd-gcp-v3.2.0-z42fr                              1565803476226196832   1/1     Running             0          61m   10.40.0.2      vinay-test-scaleup-master              <none>           <none>
system   kube-system   heapster-v1.6.0-beta.1-57874ccf9d-7phrz               6940853441142956705   0/2     ContainerCreating   0          61m   <none>         vinay-test-scaleup-minion-group-clqb   <none>           <none>
system   kube-system   kube-addon-manager-vinay-test-scaleup-master          5014754618061431440   1/1     Running             0          60m   10.40.0.2      vinay-test-scaleup-master              <none>           <none>
system   kube-system   kube-apiserver-vinay-test-scaleup-master              2973271291999151932   1/1     Running             0          61m   10.40.0.2      vinay-test-scaleup-master              <none>           <none>
system   kube-system   kube-controller-manager-vinay-test-scaleup-master     3002757412526360181   1/1     Running             0          60m   10.40.0.2      vinay-test-scaleup-master              <none>           <none>
system   kube-system   kube-dns-autoscaler-748b78969c-7vfm7                  286561218688347397    0/1     ContainerCreating   0          61m   <none>         vinay-test-scaleup-minion-group-clqb   <none>           <none>
system   kube-system   kube-proxy-vinay-test-scaleup-master                  2919100408821379765   1/1     Running             0          60m   10.40.0.2      vinay-test-scaleup-master              <none>           <none>
system   kube-system   kube-proxy-vinay-test-scaleup-minion-group-clqb       2919100408821379765   1/1     Running             0          61m   10.40.0.3      vinay-test-scaleup-minion-group-clqb   <none>           <none>
system   kube-system   kube-scheduler-vinay-test-scaleup-master              8466941937980522938   1/1     Running             1          61m   10.40.0.2      vinay-test-scaleup-master              <none>           <none>
system   kube-system   kubernetes-dashboard-848965699-ggfzm                  5093790579082669371   0/1     ContainerCreating   0          61m   <none>         vinay-test-scaleup-minion-group-clqb   <none>           <none>
system   kube-system   l7-default-backend-6497bc5bf6-9vm9s                   3892533580702677450   0/1     ContainerCreating   0          61m   <none>         vinay-test-scaleup-minion-group-clqb   <none>           <none>
system   kube-system   l7-lb-controller-v1.2.3-vinay-test-scaleup-master     5013348416365870850   1/1     Running             0          60m   10.40.0.2      vinay-test-scaleup-master              <none>           <none>
system   kube-system   metrics-server-v0.3.3-5f994fcb77-6zjtd                5954466197702541044   0/2     ContainerCreating   0          61m   <none>         vinay-test-scaleup-minion-group-clqb   <none>           <none>
root@vinay-test-scaleup-master:~#

Yes, this is expected. The key issue to be resolved, is that the produced interface was popped out from map too early. If anything happens later than that, the pod cannot reach running state. I will use another pr to fix the issue. This pr is re-purposed. It cannot solve the pod stucking issue, but it resolves some issue we found during investigation. For example, in cni, handle panic and preserve log.