CentaurusInfra / arktos

Arktos for large-scale cloud platform
Apache License 2.0
245 stars 69 forks source link

[kube-up][scale out] fluentd-gcp-v3.2.0-* pods keeps restarting #1406

Open Sindica opened 2 years ago

Sindica commented 2 years ago

What happened: In kube-up scale out 2x2 env, fluentd-gcp-v3.2.0-**** pods keep restarting in both TPs. fluentd-gcp-scaler- pods are stable in both TPs.

$ kubectl --kubeconfig cluster/kubeconfig.tp-1 get pods -owide -AT | grep fluentd
system   kube-system   fluentd-gcp-scaler-74b46b8776-82kmn                   8893177695529452993   1/1     Running            0          12h   11.0.0.16   ying-scaleout-rp-2-minion-group-dzg7   <none>           <none>
system   kube-system   fluentd-gcp-v3.2.0-5cfxw                              5437283410036407252   1/1     Running            3          12h   10.40.0.7   ying-scaleout-rp-1-minion-group-7hlc   <none>           <none>
system   kube-system   fluentd-gcp-v3.2.0-jz5kc                              325045786257080772    1/1     Running            19         12h   10.40.0.4   ying-scaleout-rp-1-master              <none>           <none>
system   kube-system   fluentd-gcp-v3.2.0-lnh4b                              5587298224783155515   1/1     Running            8          12h   10.40.0.6   ying-scaleout-rp-1-minion-group-js92   <none>           <none>
system   kube-system   fluentd-gcp-v3.2.0-m4lk5                              6352210833544958076   1/1     Running            8          12h   10.40.0.9   ying-scaleout-rp-2-minion-group-j840   <none>           <none>
system   kube-system   fluentd-gcp-v3.2.0-p62jv                              4348409786540398210   1/1     Running            10         12h   10.40.0.8   ying-scaleout-rp-2-minion-group-dzg7   <none>           <none>
system   kube-system   fluentd-gcp-v3.2.0-zw7hz                              406961396072673367    1/1     Running            14         12h   10.40.0.5   ying-scaleout-rp-2-master              <none>           <none>

sindica2000@ying-dev1:~/go/src/arktos-sonya$ kubectl --kubeconfig cluster/kubeconfig.tp-2 get pods -owide -AT | grep fluentd
system   kube-system   fluentd-gcp-scaler-74b46b8776-2wj8v                   3546096773026681994   1/1     Running            0          12h   56.0.0.3    ying-scaleout-rp-1-minion-group-js92   <none>           <none>
system   kube-system   fluentd-gcp-v3.2.0-5zzkc                              2829179266625120483   1/1     Running            9          12h   10.40.0.9   ying-scaleout-rp-2-minion-group-j840   <none>           <none>
system   kube-system   fluentd-gcp-v3.2.0-79877                              7119621349156805034   1/1     Running            12         12h   10.40.0.4   ying-scaleout-rp-1-master              <none>           <none>
system   kube-system   fluentd-gcp-v3.2.0-cfsnb                              3541394236898540512   1/1     Running            8          12h   10.40.0.6   ying-scaleout-rp-1-minion-group-js92   <none>           <none>
system   kube-system   fluentd-gcp-v3.2.0-dtjpk                              7533050460886488442   1/1     Running            3          12h   10.40.0.7   ying-scaleout-rp-1-minion-group-7hlc   <none>           <none>
system   kube-system   fluentd-gcp-v3.2.0-n5qfh                              1142027455797718328   1/1     Running            9          12h   10.40.0.8   ying-scaleout-rp-2-minion-group-dzg7   <none>           <none>
system   kube-system   fluentd-gcp-v3.2.0-p4cb9                              5841181164065941403   1/1     Running            8          12h   10.40.0.5   ying-scaleout-rp-2-master              <none>           <none>

What you expected to happen: Find out why fluentd-gcp-v3.2.0-* pods keeps restarting. Is is normal or needs to be fixed.

How to reproduce it (as minimally and precisely as possible): I am using sonya's last kube-up changes against master https://github.com/CentaurusInfra/arktos/pull/1405

Sindica commented 2 years ago

This looks like related to GCP log collecting. Could be related to external accessing restriction in current mizar integration. Postpone after 130 release.