apecloud / kubeblocks

KubeBlocks is an open-source control plane software that runs and manages databases, message queues and other stateful applications on K8s.
https://kubeblocks.io
GNU Affero General Public License v3.0
2.02k stars 165 forks source link

[BUG]kafka/starrocks restart ops failed #7578

Open ahjing99 opened 2 months ago

ahjing99 commented 2 months ago

➜ ~ kbcli version Kubernetes: v1.29.4-gke.1043002 KubeBlocks: 0.9.0-beta.34 kbcli: 0.9.0-beta.27

During the restarting of kafka combined mode, the broker pod will crash for a few seconds before running, and restart ops failed before the cluster/pods turns to running, maybe ops can wait more time

➜  ~ kbcli cluster create  kafka kafka-lydwaa                 --mode='combined'                 --cpu=0.5                 --memory=0.5                 --storage=1                 --availability-policy=none --termination-policy=Delete --version=kafka-3.3.2  --storage-enable=true                 --meta-storage=1 --replicas=1
Cluster kafka-lydwaa created

➜  ~ kbcli cluster describe kafka-lydwaa
Name: kafka-lydwaa   Created Time: Jun 20,2024 15:53 UTC+0800
NAMESPACE   CLUSTER-DEFINITION   VERSION       STATUS    TERMINATION-POLICY
default     kafka                kafka-3.3.2   Running   Delete

Endpoints:
COMPONENT   MODE        INTERNAL                                                    EXTERNAL
broker      ReadWrite   kafka-lydwaa-broker-broker.default.svc.cluster.local:9092   <none>

Topology:
COMPONENT     INSTANCE                     ROLE     STATUS    AZ              NODE                                                  CREATED-TIME
broker        kafka-lydwaa-broker-0        <none>   Running   us-central1-c   gke-yjtest-default-pool-36251504-xw1f/10.128.15.202   Jun 20,2024 15:53 UTC+0800
metrics-exp   kafka-lydwaa-metrics-exp-0   <none>   Running   us-central1-c   gke-yjtest-default-pool-36251504-z5l7/10.128.15.197   Jun 20,2024 15:53 UTC+0800

Resources Allocation:
COMPONENT     DEDICATED   CPU(REQUEST/LIMIT)   MEMORY(REQUEST/LIMIT)   STORAGE-SIZE   STORAGE-CLASS
broker        false       500m / 500m          512Mi / 512Mi           data:1Gi       kb-default-sc
                                                                       metadata:1Gi   kb-default-sc
metrics-exp   false       500m / 500m          512Mi / 512Mi           <none>         <none>

Images:
COMPONENT     TYPE             IMAGE
broker        kafka-server     docker.io/bitnami/kafka:3.3.2-debian-11-r54
metrics-exp   kafka-exporter   docker.io/bitnami/kafka-exporter:1.6.0-debian-11-r67

Show cluster events: kbcli cluster list-events -n default kafka-lydwaa

➜  ~ kbcli cluster restart kafka-lydwaa
Please type the name again(separate with white space when more than one): kafka-lydwaa
OpsRequest kafka-lydwaa-restart-dpmmd created successfully, you can view the progress:
    kbcli cluster describe-ops kafka-lydwaa-restart-dpmmd -n default

➜  ~ k get pod
NAME                           READY   STATUS             RESTARTS      AGE
kafka-lydwaa-broker-0          1/2     Running            0             16s
kafka-lydwaa-metrics-exp-0     0/1     CrashLoopBackOff   1 (12s ago)   18s

➜  ~ k logs kafka-lydwaa-metrics-exp-0 --previous
I0620 07:57:07.230134       1 kafka_exporter.go:792] Starting kafka_exporter (version=1.6.0, branch=non-git, revision=non-git)
F0620 07:57:07.991717       1 kafka_exporter.go:893] Error Init Kafka Client: kafka: client has run out of available brokers to talk to: dial tcp 10.124.1.31:9092: connect: connection refused

➜  ~ k describe ops kafka-lydwaa-restart-dpmmd
Name:         kafka-lydwaa-restart-dpmmd
Namespace:    default
Labels:       app.kubernetes.io/instance=kafka-lydwaa
              app.kubernetes.io/managed-by=kubeblocks
              ops.kubeblocks.io/ops-type=Restart
Annotations:  <none>
API Version:  apps.kubeblocks.io/v1alpha1
Kind:         OpsRequest
Metadata:
  Creation Timestamp:  2024-06-20T07:56:46Z
  Finalizers:
    opsrequest.kubeblocks.io/finalizer
  Generate Name:  kafka-lydwaa-restart-
  Generation:     2
  Managed Fields:
    API Version:  apps.kubeblocks.io/v1alpha1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:generateName:
        f:labels:
          .:
          f:app.kubernetes.io/instance:
          f:app.kubernetes.io/managed-by:
      f:spec:
        .:
        f:clusterName:
        f:preConditionDeadlineSeconds:
        f:restart:
          .:
          k:{"componentName":"broker"}:
            .:
            f:componentName:
          k:{"componentName":"metrics-exp"}:
            .:
            f:componentName:
        f:type:
    Manager:      kbcli
    Operation:    Update
    Time:         2024-06-20T07:56:46Z
    API Version:  apps.kubeblocks.io/v1alpha1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:finalizers:
          .:
          v:"opsrequest.kubeblocks.io/finalizer":
        f:labels:
          f:ops.kubeblocks.io/ops-type:
        f:ownerReferences:
          .:
          k:{"uid":"9e232459-fe5e-42f1-b98a-8f7d93ca4692"}:
    Manager:      manager
    Operation:    Update
    Time:         2024-06-20T07:56:46Z
    API Version:  apps.kubeblocks.io/v1alpha1
    Fields Type:  FieldsV1
    fieldsV1:
      f:status:
        .:
        f:clusterGeneration:
        f:completionTimestamp:
        f:components:
          .:
          f:broker:
            .:
            f:phase:
            f:progressDetails:
          f:metrics-exp:
            .:
            f:phase:
            f:progressDetails:
        f:conditions:
          .:
          k:{"type":"Failed"}:
            .:
            f:lastTransitionTime:
            f:message:
            f:reason:
            f:status:
            f:type:
          k:{"type":"Restarting"}:
            .:
            f:lastTransitionTime:
            f:message:
            f:reason:
            f:status:
            f:type:
          k:{"type":"Validated"}:
            .:
            f:lastTransitionTime:
            f:message:
            f:reason:
            f:status:
            f:type:
          k:{"type":"WaitForProgressing"}:
            .:
            f:lastTransitionTime:
            f:message:
            f:reason:
            f:status:
            f:type:
        f:phase:
        f:progress:
        f:startTimestamp:
    Manager:      manager
    Operation:    Update
    Subresource:  status
    Time:         2024-06-20T07:57:23Z
  Owner References:
    API Version:     apps.kubeblocks.io/v1alpha1
    Kind:            Cluster
    Name:            kafka-lydwaa
    UID:             9e232459-fe5e-42f1-b98a-8f7d93ca4692
  Resource Version:  248776
  UID:               1a90835d-5585-42ca-bc12-85b7aabef315
Spec:
  Cluster Name:                    kafka-lydwaa
  Pre Condition Deadline Seconds:  0
  Restart:
    Component Name:  broker
    Component Name:  metrics-exp
  Type:              Restart
Status:
  Cluster Generation:    3
  Completion Timestamp:  2024-06-20T07:57:23Z
  Components:
    Broker:
      Phase:  Running
      Progress Details:
        End Time:    2024-06-20T07:57:22Z
        Message:     Successfully restart: Pod/kafka-lydwaa-broker-0 in Component: broker
        Object Key:  Pod/kafka-lydwaa-broker-0
        Start Time:  2024-06-20T07:56:46Z
        Status:      Succeed
    Metrics - Exp:
      Phase:  Failed
      Progress Details:
        End Time:    2024-06-20T07:56:51Z
        Message:     Failed to restart: Pod/kafka-lydwaa-metrics-exp-0 in Component: metrics-exp, message:
        Object Key:  Pod/kafka-lydwaa-metrics-exp-0
        Start Time:  2024-06-20T07:56:46Z
        Status:      Failed
  Conditions:
    Last Transition Time:  2024-06-20T07:56:46Z
    Message:               wait for the controller to process the OpsRequest: kafka-lydwaa-restart-dpmmd in Cluster: kafka-lydwaa
    Reason:                WaitForProgressing
    Status:                True
    Type:                  WaitForProgressing
    Last Transition Time:  2024-06-20T07:56:46Z
    Message:               OpsRequest: kafka-lydwaa-restart-dpmmd is validated
    Reason:                ValidateOpsRequestPassed
    Status:                True
    Type:                  Validated
    Last Transition Time:  2024-06-20T07:56:46Z
    Message:               Start to restart database in Cluster: kafka-lydwaa
    Reason:                RestartStarted
    Status:                True
    Type:                  Restarting
    Last Transition Time:  2024-06-20T07:57:23Z
    Message:               Failed to process OpsRequest: kafka-lydwaa-restart-dpmmd in cluster: kafka-lydwaa, more detailed informations in status.components
    Reason:                OpsRequestFailed
    Status:                False
    Type:                  Failed
  Phase:                   Failed
  Progress:                2/2
  Start Timestamp:         2024-06-20T07:56:46Z
Events:
  Type     Reason                    Age                    From                    Message
  ----     ------                    ----                   ----                    -------
  Normal   WaitForProgressing        5m23s (x2 over 5m23s)  ops-request-controller  wait for the controller to process the OpsRequest: kafka-lydwaa-restart-dpmmd in Cluster: kafka-lydwaa
  Normal   ValidateOpsRequestPassed  5m23s                  ops-request-controller  OpsRequest: kafka-lydwaa-restart-dpmmd is validated
  Normal   RestartStarted            5m23s                  ops-request-controller  Start to restart database in Cluster: kafka-lydwaa
  Normal   Processing                5m23s                  ops-request-controller  Start to restart: Pod/kafka-lydwaa-broker-0 in Component: broker
  Normal   Succeed                   5m18s                  ops-request-controller  Successfully restart: Pod/kafka-lydwaa-metrics-exp-0 in Component: metrics-exp
  Normal   Processing                5m17s (x3 over 5m23s)  ops-request-controller  Start to restart: Pod/kafka-lydwaa-metrics-exp-0 in Component: metrics-exp
  Warning  Failed                    5m1s                   ops-request-controller  Failed to restart: Pod/kafka-lydwaa-metrics-exp-0 in Component: metrics-exp, message:
  Normal   Succeed                   4m47s                  ops-request-controller  Successfully restart: Pod/kafka-lydwaa-broker-0 in Component: broker
  Warning  OpsRequestFailed          4m46s (x2 over 4m46s)  ops-request-controller  Failed to process OpsRequest: kafka-lydwaa-restart-dpmmd in cluster: kafka-lydwaa, more detailed informations in status.components

The cluster will finally turns to running

➜  ~  kbcli cluster describe kafka-lydwaa
Name: kafka-lydwaa   Created Time: Jun 20,2024 15:53 UTC+0800
NAMESPACE   CLUSTER-DEFINITION   VERSION       STATUS    TERMINATION-POLICY
default     kafka                kafka-3.3.2   Running   Delete

Endpoints:
COMPONENT   MODE        INTERNAL                                                    EXTERNAL
broker      ReadWrite   kafka-lydwaa-broker-broker.default.svc.cluster.local:9092   <none>

Topology:
COMPONENT     INSTANCE                     ROLE     STATUS    AZ              NODE                                                  CREATED-TIME
broker        kafka-lydwaa-broker-0        <none>   Running   us-central1-c   gke-yjtest-default-pool-36251504-xw1f/10.128.15.202   Jun 20,2024 15:56 UTC+0800
metrics-exp   kafka-lydwaa-metrics-exp-0   <none>   Running   us-central1-c   gke-yjtest-default-pool-36251504-z5l7/10.128.15.197   Jun 20,2024 15:56 UTC+0800

Resources Allocation:
COMPONENT     DEDICATED   CPU(REQUEST/LIMIT)   MEMORY(REQUEST/LIMIT)   STORAGE-SIZE   STORAGE-CLASS
broker        false       500m / 500m          512Mi / 512Mi           data:1Gi       kb-default-sc
                                                                       metadata:1Gi   kb-default-sc
metrics-exp   false       500m / 500m          512Mi / 512Mi           <none>         <none>

Images:
COMPONENT     TYPE             IMAGE
broker        kafka-server     docker.io/bitnami/kafka:3.3.2-debian-11-r54
metrics-exp   kafka-exporter   docker.io/bitnami/kafka-exporter:1.6.0-debian-11-r67

Show cluster events: kbcli cluster list-events -n default kafka-lydwaa
ahjing99 commented 2 months ago

starrocks also encountered the same error, the pod restart 2 times during cluster restart, causing the ops failed, while the cluster will eventually running

https://github.com/apecloud/kubeblocks/actions/runs/9706734671/job/26791580145

wangyelei commented 2 months ago

Because the setup of a component depends on another component. When another component is a single replica, it is easy for the pod to crash continuously, but it will eventually succeed. This scene is also relatively rare, so let's not fix it for now