[BUG]kubeblocks crash after stop all nodes on GKE

ahjing99 commented 1 year ago

➜ ~ kbcli version Kubernetes: v1.25.8-gke.500 KubeBlocks: 0.6.0-alpha.13 kbcli: 0.6.0-alpha.13

After stop all nodes of the gke cluster, kubeblocks controller crash

➜  ~ k get pod -n kb-system | grep kubeblocks
kubeblocks-7f5fc565cd-2bpwx                             0/1     CrashLoopBackOff   7 (56s ago)     129m

➜  ~ k get pod
NAME                            READY   STATUS    RESTARTS   AGE
mongocluster-mongodb-0          3/3     Running   0          69m
mongocluster-mongodb-1          3/3     Running   0          69m
mongocluster-mongodb-2          3/3     Running   0          69m
mycluster-mysql-0               4/4     Running   0          69m
mycluster-mysql-1               4/4     Running   0          69m
mycluster-mysql-2               4/4     Running   0          23m
mycluster1-mysql-0              4/4     Running   0          22m
mycluster1-mysql-1              4/4     Running   0          22m
mycluster1-mysql-2              4/4     Running   0          22m
pgcluster-postgresql-0          5/5     Running   0          63m
rediscluster-redis-0            3/3     Running   0          63m
rediscluster-redis-1            3/3     Running   0          63m
rediscluster-redis-sentinel-0   1/1     Running   0          63m
rediscluster-redis-sentinel-1   1/1     Running   0          62m
rediscluster-redis-sentinel-2   1/1     Running   0          58m

➜  ~ k describe pod kubeblocks-7f5fc565cd-2bpwx -n kb-system
Name:         kubeblocks-7f5fc565cd-2bpwx
Namespace:    kb-system
Priority:     0
Node:         gke-yjtest-default-pool-c5641a32-2rt9/10.128.0.27
Start Time:   Wed, 07 Jun 2023 16:52:17 +0800
Labels:       app.kubernetes.io/instance=kubeblocks
              app.kubernetes.io/name=kubeblocks
              pod-template-hash=7f5fc565cd
Annotations:  <none>
Status:       Running
IP:           10.104.0.7
IPs:
  IP:           10.104.0.7
Controlled By:  ReplicaSet/kubeblocks-7f5fc565cd
Init Containers:
  tools:
    Container ID:  containerd://136bccd1e4d056c56d3a8b83c28adf3563f78c0ffbd20e574420426bce202eac
    Image:         registry.cn-hangzhou.aliyuncs.com/apecloud/kubeblocks-tools:0.6.0-alpha.13
    Image ID:      registry.cn-hangzhou.aliyuncs.com/apecloud/kubeblocks-tools@sha256:0b83cdf43b998dc37f496e171cb789dfd0807d560884aeed96cda84fe2a5e557
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/true
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Wed, 07 Jun 2023 18:44:55 +0800
      Finished:     Wed, 07 Jun 2023 18:44:55 +0800
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-fnpp6 (ro)
Containers:
  manager:
    Container ID:  containerd://d976cfe170c5bcf559a547e8647fe314f13e302581bae2590f1a8266666e1149
    Image:         registry.cn-hangzhou.aliyuncs.com/apecloud/kubeblocks:0.6.0-alpha.13
    Image ID:      registry.cn-hangzhou.aliyuncs.com/apecloud/kubeblocks@sha256:18f4df5ce5203ef9a5b2d300191120f86967d2e5faf29796635012d3a713212c
    Ports:         9443/TCP, 8081/TCP, 8080/TCP
    Host Ports:    0/TCP, 0/TCP, 0/TCP
    Args:
      --health-probe-bind-address=:8081
      --metrics-bind-address=:8080
      --leader-elect
      --zap-devel=false
      --zap-time-encoding=iso8601
      --zap-encoder=console
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Wed, 07 Jun 2023 19:00:11 +0800
      Finished:     Wed, 07 Jun 2023 19:00:41 +0800
    Ready:          False
    Restart Count:  7
    Liveness:       http-get http://:health/healthz delay=15s timeout=1s period=20s #success=1 #failure=3
    Readiness:      http-get http://:health/readyz delay=5s timeout=1s period=10s #success=1 #failure=3
    Environment:
      CM_NAMESPACE:                    kb-system
      CM_AFFINITY:                     {"nodeAffinity":{"preferredDuringSchedulingIgnoredDuringExecution":[{"preference":{"matchExpressions":[{"key":"kb-controller","operator":"In","values":["true"]}]},"weight":100}]}}
      CM_TOLERATIONS:                  [{"effect":"NoSchedule","key":"kb-controller","operator":"Equal","value":"true"}]
      KUBEBLOCKS_IMAGE_PULL_POLICY:    IfNotPresent
      KUBEBLOCKS_TOOLS_IMAGE:          registry.cn-hangzhou.aliyuncs.com/apecloud/kubeblocks-tools:0.6.0-alpha.13
      KUBEBLOCKS_SERVICEACCOUNT_NAME:  kubeblocks
      VOLUMESNAPSHOT:                  true
      VOLUMESNAPSHOT_API_BETA:         true
      ADDON_JOB_TTL:
      ADDON_JOB_IMAGE_PULL_POLICY:     IfNotPresent
      KUBEBLOCKS_ADDON_SA_NAME:        kubeblocks-addon-installer
    Mounts:
      /etc/kubeblocks from manager-config (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-fnpp6 (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  manager-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      kubeblocks-manager-config
    Optional:  false
  kube-api-access-fnpp6:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 kb-controller=true:NoSchedule
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason           Age                   From             Message
  ----     ------           ----                  ----             -------
  Warning  NodeNotReady     99m                   node-controller  Node is not ready
  Warning  NetworkNotReady  97m                   kubelet          network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized
  Normal   Pulling          96m                   kubelet          Pulling image "registry.cn-hangzhou.aliyuncs.com/apecloud/kubeblocks-tools:0.6.0-alpha.13"
  Normal   Pulled           96m                   kubelet          Successfully pulled image "registry.cn-hangzhou.aliyuncs.com/apecloud/kubeblocks-tools:0.6.0-alpha.13" in 29.117747442s (47.618104594s including waiting)
  Normal   Created          96m                   kubelet          Created container tools
  Normal   Started          96m                   kubelet          Started container tools
  Normal   Pulling          96m                   kubelet          Pulling image "registry.cn-hangzhou.aliyuncs.com/apecloud/kubeblocks:0.6.0-alpha.13"
  Normal   Pulled           92m                   kubelet          Successfully pulled image "registry.cn-hangzhou.aliyuncs.com/apecloud/kubeblocks:0.6.0-alpha.13" in 18.495264196s (3m9.33671681s including waiting)
  Normal   Started          61m (x4 over 92m)     kubelet          Started container manager
  Warning  Unhealthy        60m (x2 over 61m)     kubelet          Liveness probe failed: Get "http://10.104.0.7:8081/healthz": dial tcp 10.104.0.7:8081: connect: connection refused
  Warning  Unhealthy        60m (x6 over 61m)     kubelet          Readiness probe failed: Get "http://10.104.0.7:8081/readyz": dial tcp 10.104.0.7:8081: connect: connection refused
  Warning  BackOff          60m (x4 over 61m)     kubelet          Back-off restarting failed container
  Normal   Created          60m (x5 over 92m)     kubelet          Created container manager
  Normal   Pulled           60m (x4 over 83m)     kubelet          Container image "registry.cn-hangzhou.aliyuncs.com/apecloud/kubeblocks:0.6.0-alpha.13" already present on machine
  Warning  NodeNotReady     20m                   node-controller  Node is not ready
  Warning  FailedMount      18m (x4 over 18m)     kubelet          MountVolume.SetUp failed for volume "manager-config" : object "kb-system"/"kubeblocks-manager-config" not registered
  Warning  NetworkNotReady  18m (x4 over 18m)     kubelet          network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized
  Warning  FailedMount      18m (x4 over 18m)     kubelet          MountVolume.SetUp failed for volume "kube-api-access-fnpp6" : object "kb-system"/"kube-root-ca.crt" not registered
  Normal   Pulling          18m                   kubelet          Pulling image "registry.cn-hangzhou.aliyuncs.com/apecloud/kubeblocks-tools:0.6.0-alpha.13"
  Normal   Pulled           17m                   kubelet          Successfully pulled image "registry.cn-hangzhou.aliyuncs.com/apecloud/kubeblocks-tools:0.6.0-alpha.13" in 31.053592479s (47.26951908s including waiting)
  Normal   Created          17m                   kubelet          Created container tools
  Normal   Started          17m                   kubelet          Started container tools
  Normal   Pulling          17m                   kubelet          Pulling image "registry.cn-hangzhou.aliyuncs.com/apecloud/kubeblocks:0.6.0-alpha.13"
  Normal   Started          16m                   kubelet          Started container manager
  Normal   Pulled           16m                   kubelet          Successfully pulled image "registry.cn-hangzhou.aliyuncs.com/apecloud/kubeblocks:0.6.0-alpha.13" in 19.245281318s (1m37.651096461s including waiting)
  Warning  Unhealthy        15m (x2 over 15m)     kubelet          Readiness probe failed: Get "http://10.104.0.7:8081/readyz": dial tcp 10.104.0.7:8081: connect: connection refused
  Warning  Unhealthy        15m                   kubelet          Liveness probe failed: Get "http://10.104.0.7:8081/healthz": dial tcp 10.104.0.7:8081: connect: connection refused
  Normal   Created          15m (x2 over 16m)     kubelet          Created container manager
  Normal   Pulled           15m                   kubelet          Container image "registry.cn-hangzhou.aliyuncs.com/apecloud/kubeblocks:0.6.0-alpha.13" already present on machine
  Warning  BackOff          3m32s (x45 over 14m)  kubelet          Back-off restarting failed container

➜  ~ k logs kubeblocks-7f5fc565cd-2bpwx  -n kb-system
Defaulted container "manager" out of: manager, tools (init)
Error from server: Get "https://10.128.0.27:10250/containerLogs/kb-system/kubeblocks-7f5fc565cd-2bpwx/manager": No agent available

ahjing99 commented 1 year ago

When apply all pods one single node, and stop the node, more logs returned:

➜  ~ k logs kubeblocks-7f5fc565cd-zxvmk -n kb-system
Defaulted container "manager" out of: manager, tools (init)
2023-06-07T12:11:11.353Z    INFO    setup   config file: /etc/kubeblocks/config.yaml
2023-06-07T12:11:11.353Z    INFO    setup   config settings: map[alsologtostderr:false backup_pv_configmap_name: backup_pv_configmap_namespace: backup_pvc_create_policy: backup_pvc_init_capacity: backup_pvc_name: backup_pvc_storage_class: cert_dir:/tmp/k8s-webhook-server/serving-certs cm_namespace:kb-system cm_recon_retry_duration_ms:100 config_manager_grpc_port:9901 config_manager_log_level:info data_plane_affinity:{"nodeAffinity":{"preferredDuringSchedulingIgnoredDuringExecution":[{"preference":{"matchExpressions":[{"key":"kb-data","operator":"In","values":["true"]}]},"weight":100}]}} data_plane_tolerations:[{"effect":"NoSchedule","key":"kb-data","operator":"Equal","value":"true"}] enable_debug_sysaccounts:false health_probe_bind_address::8081 kill_container_signal:SIGKILL kubeblocks_addon_helm_install_options:[--atomic --cleanup-on-fail --wait] kubeblocks_addon_helm_uninstall_options:[] kubeblocks_addon_sa_name:kubeblocks-addon-installer kubeblocks_serviceaccount_name:kubeblocks kubeblocks_tools_image:registry.cn-hangzhou.aliyuncs.com/apecloud/kubeblocks-tools:0.6.0-alpha.13 kubeconfig: leader_elect:true log_backtrace_at::0 log_dir: logtostderr:false maxconcurrentreconciles_addon:8 maxconcurrentreconciles_clusterdef:8 maxconcurrentreconciles_clusterversion:8 maxconcurrentreconciles_dataprotection:8 metrics_bind_address::8080 pod_min_ready_seconds:10 probe_service_grpc_port:50001 probe_service_http_port:3501 probe_service_log_level:info stderrthreshold:2 v:0 vmodule: volumesnapshot:true volumesnapshot_api_beta:true zap_devel:false zap_encoder:console zap_log_level: zap_stacktrace_level: zap_time_encoding:iso8601]
2023-06-07T12:11:41.355Z    ERROR   Failed to get API Group-Resources   {"error": "Get \"https://10.116.0.1:443/api?timeout=32s\": dial tcp 10.116.0.1:443: i/o timeout"}
sigs.k8s.io/controller-runtime/pkg/cluster.New
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.4/pkg/cluster/cluster.go:161
sigs.k8s.io/controller-runtime/pkg/manager.New
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.4/pkg/manager/manager.go:359
main.main
    /src/cmd/manager/main.go:226
runtime.main
    /usr/local/go/src/runtime/proc.go:250
2023-06-07T12:11:41.356Z    ERROR   setup   unable to start manager {"error": "Get \"https://10.116.0.1:443/api?timeout=32s\": dial tcp 10.116.0.1:443: i/o timeout"}
main.main
    /src/cmd/manager/main.go:255
runtime.main
    /usr/local/go/src/runtime/proc.go:250

github-actions[bot] commented 1 year ago

This issue has been marked as stale because it has been open for 30 days with no activity

apecloud / kubeblocks

[BUG]kubeblocks crash after stop all nodes on GKE #3628