apecloud / kubeblocks

KubeBlocks is an open-source control plane software that runs and manages databases, message queues and other stateful applications on K8s.
https://kubeblocks.io
GNU Affero General Public License v3.0
2.08k stars 169 forks source link

[BUG] KB1.0 cluster vscale failed #8063

Open haowen159 opened 1 month ago

haowen159 commented 1 month ago

Describe the bug A clear and concise description of what the bug is.

kbcli version
Kubernetes: v1.29.6-gke.1326000
KubeBlocks: 1.0.0-alpha.5
kbcli: 1.0.0-alpha.0

To Reproduce Steps to reproduce the behavior:

  1. create etcd cluster cluster yaml
    apiVersion: apps.kubeblocks.io/v1alpha1
    kind: Cluster
    metadata:
    name: etcd-ptyfua
    namespace: default
    spec:
    terminationPolicy: WipeOut
    componentSpecs:
    - name: etcd
      componentDef: etcd
      replicas: 3
      resources:
        requests:
          cpu: 100m
          memory: 0.5Gi
        limits:
          cpu: 100m
          memory: 0.5Gi
      volumeClaimTemplates:
        - name: data
          spec:
            storageClassName:
            accessModes:
              - ReadWriteOnce
            resources:
              requests:
                storage: 1Gi
  2. cluster status
    kbcli cluster list  
    ]NAME          NAMESPACE   CLUSTER-DEFINITION   VERSION   TERMINATION-POLICY   STATUS    CREATED-TIME                 
    etcd-ptyfua   default                                    WipeOut              Running   Aug 30,2024 17:15 UTC+0800  

    3.vscale cluster

    kbcli cluster vscale etcd-ptyfua --auto-approve --force=true --components etcd --cpu 0.2 --memory 0.6
  3. see error
    (base) kb@192 testinfra % k get pod
    NAME                 READY   STATUS             RESTARTS        AGE
    etcd-ptyfua-etcd-0   2/2     Running            0               10m
    etcd-ptyfua-etcd-1   2/2     Running            0               10m
    etcd-ptyfua-etcd-2   1/2     CrashLoopBackOff   5 (2m41s ago)   6m4s
    (base) kb@192 testinfra % k get cluster
    NAME          CLUSTER-DEFINITION   VERSION   TERMINATION-POLICY   STATUS   AGE
    etcd-ptyfua                                  WipeOut              Failed   10m
    (base) kb@192 testinfra % k get ops
    NAME                                TYPE              CLUSTER       STATUS   PROGRESS   AGE
    etcd-ptyfua-verticalscaling-47d2f   VerticalScaling   etcd-ptyfua   Failed   1/3        6m46s
  4. describe pod

    k describe pod etcd-ptyfua-etcd-2
    Name:             etcd-ptyfua-etcd-2
    Namespace:        default
    Priority:         0
    Service Account:  kb-etcd-ptyfua
    Node:             gke-dhtest-gke-dhtest-gke-05a50c4d-dzqd/10.128.0.36
    Start Time:       Fri, 30 Aug 2024 17:19:46 +0800
    Labels:           app.kubernetes.io/component=etcd
                  app.kubernetes.io/instance=etcd-ptyfua
                  app.kubernetes.io/managed-by=kubeblocks
                  app.kubernetes.io/name=etcd
                  app.kubernetes.io/version=etcd
                  apps.kubeblocks.io/cluster-uid=1a63a11f-eb94-42f8-a192-e0710d3243ee
                  apps.kubeblocks.io/component-name=etcd
                  apps.kubeblocks.io/pod-name=etcd-ptyfua-etcd-2
                  componentdefinition.kubeblocks.io/name=etcd
                  controller-revision-hash=58bc8954c9
                  workloads.kubeblocks.io/instance=etcd-ptyfua-etcd
                  workloads.kubeblocks.io/managed-by=InstanceSet
    Annotations:      apps.kubeblocks.io/component-replicas: 3
    Status:           Running
    IP:               10.0.6.99
    IPs:
    IP:           10.0.6.99
    Controlled By:  InstanceSet/etcd-ptyfua-etcd
    Init Containers:
    inject-shell:
    Container ID:  containerd://b15770ef403e72456d973b81e6800b045f6a12f8d8e31c7d3ad754b612a35ca8
    Image:         docker.io/busybox:1.35-musl
    Image ID:      docker.io/library/busybox@sha256:eaa51c8ca08bd769af7acc4e9748c01db3d0b8da22f35e55ce9199f980e8deda
    Port:          <none>
    Host Port:     <none>
    Command:
      bin/sh
      -c
      #!/bin/sh
    
      # inject shell if needed
    
      busyboxAction() {
        # copy sh to /shell in order to adapt distroless entrypoint
        cp /bin/sh /shell
      }
    
      distrolessAction() {
        echo "etcd image build with distroless, injecting brinaries in order to run scripts"
        cp /bin/* /shell
      }
    
      # versionCheck only check image type but not availability
      checkVersionAndInject() {
        local version=$1
        echo "$version" | grep -Eq '^v[0-9]+\.[0-9]+\.[0-9]+$'
        if [ $? -ne 0 ]; then
          echo "Invalid version format, check vars ETCD_VERSION"
          exit 1
        fi
    
        versionParse=$(echo "$version" | sed 's/^v//')
        major=$(echo "$versionParse" | cut -d. -f1)
        minor=$(echo "$versionParse" | cut -d. -f2)
        patch=$(echo "$versionParse" | cut -d. -f3)
    
        # <=3.3 || <= 3.4.22 || <=3.5.6 all use busybox https://github.com/etcd-io/etcd/tree/main/CHANGELOG
        if [ $major -lt 3 ] || ([ $major -eq 3 ] && [ $minor -le 3 ]); then
          busyboxAction
        elif [ $major -eq 3 ] && [ $minor -eq 4 ] && [ $patch -le 22 ]; then
          busyboxAction
        elif [ $major -eq 3 ] && [ $minor -eq 5 ] && [ $patch -le 6 ]; then
          busyboxAction
        else
          distrolessAction
        fi
      }
    
      checkVersionAndInject $ETCD_VERSION
    
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Fri, 30 Aug 2024 17:19:50 +0800
      Finished:     Fri, 30 Aug 2024 17:19:51 +0800
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     0
      memory:  0
    Requests:
      cpu:     0
      memory:  0
    Environment Variables from:
      etcd-ptyfua-etcd-env  ConfigMap  Optional: false
    Environment:
      KB_POD_NAME:   etcd-ptyfua-etcd-2 (v1:metadata.name)
      KB_POD_UID:     (v1:metadata.uid)
      KB_NAMESPACE:  default (v1:metadata.namespace)
      KB_SA_NAME:     (v1:spec.serviceAccountName)
      KB_NODENAME:    (v1:spec.nodeName)
      KB_HOST_IP:     (v1:status.hostIP)
      KB_POD_IP:      (v1:status.podIP)
      KB_POD_IPS:     (v1:status.podIPs)
      KB_HOSTIP:      (v1:status.hostIP)
      KB_PODIP:       (v1:status.podIP)
      KB_PODIPS:      (v1:status.podIPs)
      KB_POD_FQDN:   $(KB_POD_NAME).etcd-ptyfua-etcd-headless.$(KB_NAMESPACE).svc
    Mounts:
      /shell from shell (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-bmht2 (ro)
    init-kbagent:
    Container ID:  containerd://da6d171d4afedc7062f255be07a0ab989162e3cf7e40728fe456aee3c1bf1700
    Image:         docker.io/apecloud/kubeblocks-tools:1.0.0-alpha.5
    Image ID:      docker.io/apecloud/kubeblocks-tools@sha256:998b35a1fad892199d739d7d7bf52009089ef690897d60979b79e078ebacaecc
    Port:          <none>
    Host Port:     <none>
    Command:
      cp
      -r
      /bin/kbagent
      /bin/curl
      /kubeblocks/
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Fri, 30 Aug 2024 17:19:54 +0800
      Finished:     Fri, 30 Aug 2024 17:19:54 +0800
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     0
      memory:  0
    Requests:
      cpu:     0
      memory:  0
    Environment Variables from:
      etcd-ptyfua-etcd-env  ConfigMap  Optional: false
    Environment:
      KB_POD_NAME:   etcd-ptyfua-etcd-2 (v1:metadata.name)
      KB_POD_UID:     (v1:metadata.uid)
      KB_NAMESPACE:  default (v1:metadata.namespace)
      KB_SA_NAME:     (v1:spec.serviceAccountName)
      KB_NODENAME:    (v1:spec.nodeName)
      KB_HOST_IP:     (v1:status.hostIP)
      KB_POD_IP:      (v1:status.podIP)
      KB_POD_IPS:     (v1:status.podIPs)
      KB_HOSTIP:      (v1:status.hostIP)
      KB_PODIP:       (v1:status.podIP)
      KB_PODIPS:      (v1:status.podIPs)
      KB_POD_FQDN:   $(KB_POD_NAME).etcd-ptyfua-etcd-headless.$(KB_NAMESPACE).svc
    Mounts:
      /kubeblocks from kubeblocks (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-bmht2 (ro)
    Containers:
    etcd:
    Container ID:  containerd://79dbf77ab0970656dca6a39d3462dbc0550e20137e61971d278283c6c38f22cf
    Image:         docker.io/apecloud/etcd:v3.5.15
    Image ID:      docker.io/apecloud/etcd@sha256:0934690612905554eb61ddefb9faaaecb47c2f6931dbb453e694358092ee8990
    Ports:         2379/TCP, 2380/TCP
    Host Ports:    0/TCP, 0/TCP
    Command:
      /shell/sh
      -c
      export PATH=$PATH:/shell
      # for convenient to use the same entrypoint
      if [ ! -e /bin/sh ]; then
        cp /shell/sh /bin
      fi
      /scripts/start.sh
    
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       StartError
      Message:      failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error setting cgroup config for procHooks process: openat2 /sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod053fc1c3_d474_460e_9f3c_88f7a763d2b9.slice/cri-containerd-79dbf77ab0970656dca6a39d3462dbc0550e20137e61971d278283c6c38f22cf.scope/cgroup.controllers: no such file or directory: unknown
      Exit Code:    128
      Started:      Thu, 01 Jan 1970 08:00:00 +0800
      Finished:     Fri, 30 Aug 2024 17:21:38 +0800
    Ready:          False
    Restart Count:  4
    Limits:
      cpu:     200m
      memory:  600m
    Requests:
      cpu:     200m
      memory:  600m
    Environment Variables from:
      etcd-ptyfua-etcd-env      ConfigMap  Optional: false
      etcd-ptyfua-etcd-rsm-env  ConfigMap  Optional: false
    Environment:
      KB_POD_NAME:   etcd-ptyfua-etcd-2 (v1:metadata.name)
      KB_POD_UID:     (v1:metadata.uid)
      KB_NAMESPACE:  default (v1:metadata.namespace)
      KB_SA_NAME:     (v1:spec.serviceAccountName)
      KB_NODENAME:    (v1:spec.nodeName)
      KB_HOST_IP:     (v1:status.hostIP)
      KB_POD_IP:      (v1:status.podIP)
      KB_POD_IPS:     (v1:status.podIPs)
      KB_HOSTIP:      (v1:status.hostIP)
      KB_PODIP:       (v1:status.podIP)
      KB_PODIPS:      (v1:status.podIPs)
      KB_POD_FQDN:   $(KB_POD_NAME).etcd-ptyfua-etcd-headless.$(KB_NAMESPACE).svc
    Mounts:
      /etc/etcd from config (rw)
      /scripts from scripts (rw)
      /shell from shell (rw)
      /var/run/etcd from data (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-bmht2 (ro)
    kbagent:
    Container ID:  containerd://d18767b34fde6bdee6fb0ef334ead00603c3aea8b78e39c2ced53e9c09d798ad
    Image:         docker.io/apecloud/etcd:v3.5.6
    Image ID:      docker.io/apecloud/etcd@sha256:28cb0630cb8536504f9bd547c3e63e608242c40dbffb1464c892d8d59fd3da44
    Port:          3501/TCP
    Host Port:     0/TCP
    Command:
      /kubeblocks/kbagent
    Args:
      --port
      3501
    State:          Running
      Started:      Fri, 30 Aug 2024 17:19:56 +0800
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     0
      memory:  0
    Requests:
      cpu:     0
      memory:  0
    Startup:   tcp-socket :3501 delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment Variables from:
      etcd-ptyfua-etcd-env      ConfigMap  Optional: false
      etcd-ptyfua-etcd-rsm-env  ConfigMap  Optional: false
    Environment:
      KB_POD_NAME:      etcd-ptyfua-etcd-2 (v1:metadata.name)
      KB_POD_UID:        (v1:metadata.uid)
      KB_NAMESPACE:     default (v1:metadata.namespace)
      KB_SA_NAME:        (v1:spec.serviceAccountName)
      KB_NODENAME:       (v1:spec.nodeName)
      KB_HOST_IP:        (v1:status.hostIP)
      KB_POD_IP:         (v1:status.podIP)
      KB_POD_IPS:        (v1:status.podIPs)
      KB_HOSTIP:         (v1:status.hostIP)
      KB_PODIP:          (v1:status.podIP)
      KB_PODIPS:         (v1:status.podIPs)
      KB_POD_FQDN:      $(KB_POD_NAME).etcd-ptyfua-etcd-headless.$(KB_NAMESPACE).svc
      CLUSTER_DOMAIN:   .cluster.local
      KB_AGENT_ACTION:  [{"name":"switchover","exec":{"command":["/bin/sh","-c","set -ex\n  #!/bin/sh\n  \n  # config file used to bootstrap the etcd cluster\n  configFile=$TMP_CONFIG_PATH\n  \n  checkBackupFile() {\n    local backupFile=$1\n    output=$(etcdutl snapshot status ${backupFile})\n    # check if the command was successful\n    if [ $? -ne 0 ]; then\n      echo \"ERROR: Failed to check the backup file with etcdutl\"\n      exit 1\n    fi\n    # extract the total key from the output\n    totalKey=$(echo $output | awk -F', ' '{print $3}')\n    # check if total key is a number\n    case $totalKey in\n      *[!0-9]*)\n        echo \"ERROR: snapshot totalKey is not a valid number.\"\n        exit 1\n        ;;\n    esac\n  \n    # define a threshold to check if the total key count is too low\n    # consider increasing this value when dealing with production-grade etcd cluster\n    threshold=$BACKUP_KEY_THRESHOLD #[modifiable]\n    if [ \"$totalKey\" -lt $threshold ]; then\n      echo \"WARNING: snapshot totalKey is less than the threshold\"\n      exit 1\n    fi\n  }\n  \n  getClientProtocol() {\n    # check client tls if is enabled\n    line=$(grep 'advertise-client-urls' ${configFile})\n    if echo $line | grep -q 'https'; then\n      echo \"https\"\n    elif echo $line | grep -q 'http'; then\n      echo \"http\"\n    fi\n  }\n  \n  getPeerProtocol() {\n    # check peer tls if is enabled\n    line=$(grep 'initial-advertise-peer-urls' ${configFile})\n    if echo $line | grep -q 'https'; then\n      echo \"https\"\n    elif echo $line | grep -q 'http'; then\n      echo \"http\"\n    fi\n  }\n  \n  execEtcdctl() {\n    local endpoints=$1\n    shift\n    clientProtocol=$(getClientProtocol)\n    tlsDir=$TLS_MOUNT_PATH\n    # check if the clientProtocol is https and the tlsDir is not empty\n    if [ $clientProtocol = \"https\" ] \u0026\u0026 [ -d \"$tlsDir\" ] \u0026\u0026 [ -s \"${tlsDir}/ca.crt\" ] \u0026\u0026 [ -s \"${tlsDir}/tls.crt\" ] \u0026\u0026 [ -s \"${tlsDir}/tls.key\" ]; then\n      etcdctl --endpoints=${endpoints} --cacert=${tlsDir}/ca.crt --cert=${tlsDir}/tls.crt --key=${tlsDir}/tls.key \"$@\"\n    elif [ $clientProtocol = \"http\" ]; then\n      etcdctl --endpoints=${endpoints} \"$@\"\n    else\n      echo \"ERROR: bad etcdctl args: clientProtocol:${clientProtocol}, endpoints:${endpoints}, tlsDir:${tlsDir}, please check!\"\n      exit 1\n    fi\n    # check if the etcdctl command was successful\n    if [ $? -ne 0 ]; then\n      echo \"etcdctl command failed\"\n      exit 1\n    fi\n  }\n  \n  # this function will be deprecated in the future\n  execEtcdctlNoCheckTLS() {\n    local endpoints=$1\n    shift\n    etcdctl --endpoints=${endpoints} \"$@\"\n    # check if the etcdctl command was successful\n    if [ $? -ne 0 ]; then\n      echo \"etcdctl command failed\"\n      exit 1\n    fi\n  }\n  \n  updateLeaderIfNeeded() {\n    local retries=$1\n  \n    if [ $retries -le 0 ]; then\n      echo \"Maximum number of retries reached, leader is not ready\"\n      exit 1\n    fi\n  \n    status=$(execEtcdctlNoCheckTLS ${leaderEndpoint} endpoint status)\n    isLeader=$(echo $status | awk -F ', ' '{print $5}')\n    if [ \"$isLeader\" = \"false\" ]; then\n      echo \"leader out of status, try to redirect to new leader\"\n      peerEndpoints=$(execEtcdctlNoCheckTLS \"$leaderEndpoint\" member list | awk -F', ' '{print $5}' | tr '\\n' ',' | sed 's#,$##')\n      leaderEndpoint=$(execEtcdctlNoCheckTLS \"$peerEndpoints\" endpoint status | awk -F', ' '$5==\"true\" {print $1}')\n      if [ $leaderEndpoint = \"\" ]; then\n        echo \"leader is not ready, wait for 2s...\"\n        sleep 2\n        updateLeaderIfNeeded $(expr $retries - 1)\n      fi\n    fi\n  }\n  #!/bin/sh\n  \n  switchoverWithCandidate() {\n    leaderEndpoint=${LEADER_POD_FQDN}:2379\n    candidateEndpoint=${KB_SWITCHOVER_CANDIDATE_FQDN}:2379\n    \n    # see common.sh, this function may change leaderEndpoint\n    updateLeaderIfNeeded 3\n    \n    if [ \"$leaderEndpoint\" = \"$candidateEndpoint\" ]; then\n      echo \"leader is the same as candidate, no need to switch\"\n      exit 0\n    fi\n    \n    candidateID=$(execEtcdctlNoCheckTLS ${candidateEndpoint} endpoint status | awk -F', ' '{print $2}')\n    execEtcdctlNoCheckTLS ${leaderEndpoint} move-leader $candidateID\n    \n    status=$(execEtcdctlNoCheckTLS ${candidateEndpoint} endpoint status)\n    isLeader=$(echo ${status} | awk -F ', ' '{print $5}')\n    \n    if [ \"$isLeader\" = \"true\" ]; then\n      echo \"switchover successfully\"\n    else\n      echo \"switchover failed, please check!\"\n      exit 1\n    fi\n  }\n  \n  switchoverWithoutCandidate() {\n    leaderEndpoint=${LEADER_POD_FQDN}:2379\n    oldLeaderEndpoint=$leaderEndpoint\n    \n    # see common.sh, this function may change leaderEndpoint\n    updateLeaderIfNeeded 3\n    \n    if [ \"$oldLeaderEndpoint\" != \"$leaderEndpoint\" ]; then\n      echo \"leader already changed, no need to switch\"\n      exit 0\n    fi\n    \n    leaderID=$(execEtcdctlNoCheckTLS ${leaderEndpoint} endpoint status | awk -F', ' '{print $2}')\n    peerIDs=$(execEtcdctlNoCheckTLS ${leaderEndpoint} member list | awk -F', ' '{print $1}')\n    randomCandidateID=$(echo \"$peerIDs\" | grep -v \"$leaderID\" | awk 'NR==1')\n    \n    if [ -z \"$randomCandidateID\" ]; then\n      echo \"no candidate found\"\n      exit 1\n    fi\n    \n    execEtcdctlNoCheckTLS $leaderEndpoint move-leader $randomCandidateID\n    \n    status=$(execEtcdctlNoCheckTLS $leaderEndpoint endpoint status)\n    isLeader=$(echo $status | awk -F ', ' '{print $5}')\n    \n    if [ \"$isLeader\" = \"false\" ]; then\n      echo \"switchover successfully\"\n    else\n      echo \"switchover failed, please check!\"\n      exit 1\n    fi\n  }\n  \n\nif [ -z \"$KB_SWITCHOVER_CANDIDATE_FQDN\" ]; then\n    switchoverWithoutCandidate\nelse\n    switchoverWithCandidate\nfi\n"]}},{"name":"memberJoin","exec":{"command":["/bin/sh","-c","#!/bin/sh\n\n# config file used to bootstrap the etcd cluster\nconfigFile=$TMP_CONFIG_PATH\n\ncheckBackupFile() {\n  local backupFile=$1\n  output=$(etcdutl snapshot status ${backupFile})\n  # check if the command was successful\n  if [ $? -ne 0 ]; then\n    echo \"ERROR: Failed to check the backup file with etcdutl\"\n    exit 1\n  fi\n  # extract the total key from the output\n  totalKey=$(echo $output | awk -F', ' '{print $3}')\n  # check if total key is a number\n  case $totalKey in\n    *[!0-9]*)\n      echo \"ERROR: snapshot totalKey is not a valid number.\"\n      exit 1\n      ;;\n  esac\n\n  # define a threshold to check if the total key count is too low\n  # consider increasing this value when dealing with production-grade etcd cluster\n  threshold=$BACKUP_KEY_THRESHOLD #[modifiable]\n  if [ \"$totalKey\" -lt $threshold ]; then\n    echo \"WARNING: snapshot totalKey is less than the threshold\"\n    exit 1\n  fi\n}\n\ngetClientProtocol() {\n  # check client tls if is enabled\n  line=$(grep 'advertise-client-urls' ${configFile})\n  if echo $line | grep -q 'https'; then\n    echo \"https\"\n  elif echo $line | grep -q 'http'; then\n    echo \"http\"\n  fi\n}\n\ngetPeerProtocol() {\n  # check peer tls if is enabled\n  line=$(grep 'initial-advertise-peer-urls' ${configFile})\n  if echo $line | grep -q 'https'; then\n    echo \"https\"\n  elif echo $line | grep -q 'http'; then\n    echo \"http\"\n  fi\n}\n\nexecEtcdctl() {\n  local endpoints=$1\n  shift\n  clientProtocol=$(getClientProtocol)\n  tlsDir=$TLS_MOUNT_PATH\n  # check if the clientProtocol is https and the tlsDir is not empty\n  if [ $clientProtocol = \"https\" ] \u0026\u0026 [ -d \"$tlsDir\" ] \u0026\u0026 [ -s \"${tlsDir}/ca.crt\" ] \u0026\u0026 [ -s \"${tlsDir}/tls.crt\" ] \u0026\u0026 [ -s \"${tlsDir}/tls.key\" ]; then\n    etcdctl --endpoints=${endpoints} --cacert=${tlsDir}/ca.crt --cert=${tlsDir}/tls.crt --key=${tlsDir}/tls.key \"$@\"\n  elif [ $clientProtocol = \"http\" ]; then\n    etcdctl --endpoints=${endpoints} \"$@\"\n  else\n    echo \"ERROR: bad etcdctl args: clientProtocol:${clientProtocol}, endpoints:${endpoints}, tlsDir:${tlsDir}, please check!\"\n    exit 1\n  fi\n  # check if the etcdctl command was successful\n  if [ $? -ne 0 ]; then\n    echo \"etcdctl command failed\"\n    exit 1\n  fi\n}\n\n# this function will be deprecated in the future\nexecEtcdctlNoCheckTLS() {\n  local endpoints=$1\n  shift\n  etcdctl --endpoints=${endpoints} \"$@\"\n  # check if the etcdctl command was successful\n  if [ $? -ne 0 ]; then\n    echo \"etcdctl command failed\"\n    exit 1\n  fi\n}\n\nupdateLeaderIfNeeded() {\n  local retries=$1\n\n  if [ $retries -le 0 ]; then\n    echo \"Maximum number of retries reached, leader is not ready\"\n    exit 1\n  fi\n\n  status=$(execEtcdctlNoCheckTLS ${leaderEndpoint} endpoint status)\n  isLeader=$(echo $status | awk -F ', ' '{print $5}')\n  if [ \"$isLeader\" = \"false\" ]; then\n    echo \"leader out of status, try to redirect to new leader\"\n    peerEndpoints=$(execEtcdctlNoCheckTLS \"$leaderEndpoint\" member list | awk -F', ' '{print $5}' | tr '\\n' ',' | sed 's#,$##')\n    leaderEndpoint=$(execEtcdctlNoCheckTLS \"$peerEndpoints\" endpoint status | awk -F', ' '$5==\"true\" {print $1}')\n    if [ $leaderEndpoint = \"\" ]; then\n      echo \"leader is not ready, wait for 2s...\"\n      sleep 2\n      updateLeaderIfNeeded $(expr $retries - 1)\n    fi\n  fi\n}\n#!/bin/sh\n\nset -exo pipefail\necho \"etcd member join...\"\n# TODO\n"]}},{"name":"memberLeave","exec":{"command":["/bin/sh","-c","#!/bin/sh\n\n# config file used to bootstrap the etcd cluster\nconfigFile=$TMP_CONFIG_PATH\n\ncheckBackupFile() {\n  local backupFile=$1\n  output=$(etcdutl snapshot status ${backupFile})\n  # check if the command was successful\n  if [ $? -ne 0 ]; then\n    echo \"ERROR: Failed to check the backup file with etcdutl\"\n    exit 1\n  fi\n  # extract the total key from the output\n  totalKey=$(echo $output | awk -F', ' '{print $3}')\n  # check if total key is a number\n  case $totalKey in\n    *[!0-9]*)\n      echo \"ERROR: snapshot totalKey is not a valid number.\"\n      exit 1\n      ;;\n  esac\n\n  # define a threshold to check if the total key count is too low\n  # consider increasing this value when dealing with production-grade etcd cluster\n  threshold=$BACKUP_KEY_THRESHOLD #[modifiable]\n  if [ \"$totalKey\" -lt $threshold ]; then\n    echo \"WARNING: snapshot totalKey is less than the threshold\"\n    exit 1\n  fi\n}\n\ngetClientProtocol() {\n  # check client tls if is enabled\n  line=$(grep 'advertise-client-urls' ${configFile})\n  if echo $line | grep -q 'https'; then\n    echo \"https\"\n  elif echo $line | grep -q 'http'; then\n    echo \"http\"\n  fi\n}\n\ngetPeerProtocol() {\n  # check peer tls if is enabled\n  line=$(grep 'initial-advertise-peer-urls' ${configFile})\n  if echo $line | grep -q 'https'; then\n    echo \"https\"\n  elif echo $line | grep -q 'http'; then\n    echo \"http\"\n  fi\n}\n\nexecEtcdctl() {\n  local endpoints=$1\n  shift\n  clientProtocol=$(getClientProtocol)\n  tlsDir=$TLS_MOUNT_PATH\n  # check if the clientProtocol is https and the tlsDir is not empty\n  if [ $clientProtocol = \"https\" ] \u0026\u0026 [ -d \"$tlsDir\" ] \u0026\u0026 [ -s \"${tlsDir}/ca.crt\" ] \u0026\u0026 [ -s \"${tlsDir}/tls.crt\" ] \u0026\u0026 [ -s \"${tlsDir}/tls.key\" ]; then\n    etcdctl --endpoints=${endpoints} --cacert=${tlsDir}/ca.crt --cert=${tlsDir}/tls.crt --key=${tlsDir}/tls.key \"$@\"\n  elif [ $clientProtocol = \"http\" ]; then\n    etcdctl --endpoints=${endpoints} \"$@\"\n  else\n    echo \"ERROR: bad etcdctl args: clientProtocol:${clientProtocol}, endpoints:${endpoints}, tlsDir:${tlsDir}, please check!\"\n    exit 1\n  fi\n  # check if the etcdctl command was successful\n  if [ $? -ne 0 ]; then\n    echo \"etcdctl command failed\"\n    exit 1\n  fi\n}\n\n# this function will be deprecated in the future\nexecEtcdctlNoCheckTLS() {\n  local endpoints=$1\n  shift\n  etcdctl --endpoints=${endpoints} \"$@\"\n  # check if the etcdctl command was successful\n  if [ $? -ne 0 ]; then\n    echo \"etcdctl command failed\"\n    exit 1\n  fi\n}\n\nupdateLeaderIfNeeded() {\n  local retries=$1\n\n  if [ $retries -le 0 ]; then\n    echo \"Maximum number of retries reached, leader is not ready\"\n    exit 1\n  fi\n\n  status=$(execEtcdctlNoCheckTLS ${leaderEndpoint} endpoint status)\n  isLeader=$(echo $status | awk -F ', ' '{print $5}')\n  if [ \"$isLeader\" = \"false\" ]; then\n    echo \"leader out of status, try to redirect to new leader\"\n    peerEndpoints=$(execEtcdctlNoCheckTLS \"$leaderEndpoint\" member list | awk -F', ' '{print $5}' | tr '\\n' ',' | sed 's#,$##')\n    leaderEndpoint=$(execEtcdctlNoCheckTLS \"$peerEndpoints\" endpoint status | awk -F', ' '$5==\"true\" {print $1}')\n    if [ $leaderEndpoint = \"\" ]; then\n      echo \"leader is not ready, wait for 2s...\"\n      sleep 2\n      updateLeaderIfNeeded $(expr $retries - 1)\n    fi\n  fi\n}\n#!/bin/sh\nset -ex\nendpoints=$(echo $KB_MEMBER_ADDRESSES | tr ',' '\\n')\nleaverEndpoint=$(echo \"$endpoints\" | grep $KB_LEAVE_MEMBER_POD_NAME)\n\nif [ $leaverEndpoint = \"\" ]; then\n  echo \"ERROR: leave member pod name not found in member addresses\"\n  exit 1\nfi\n\nETCDID=$(execEtcdctl $leaverEndpoint endpoint status | awk -F', ' '{print $2}')\nexecEtcdctl $KB_MEMBER_ADDRESSES member remove $ETCDID\n"]}},{"name":"roleProbe","exec":{"command":["/bin/sh","-c","#!/bin/sh\n\n# config file used to bootstrap the etcd cluster\nconfigFile=$TMP_CONFIG_PATH\n\ncheckBackupFile() {\n  local backupFile=$1\n  output=$(etcdutl snapshot status ${backupFile})\n  # check if the command was successful\n  if [ $? -ne 0 ]; then\n    echo \"ERROR: Failed to check the backup file with etcdutl\"\n    exit 1\n  fi\n  # extract the total key from the output\n  totalKey=$(echo $output | awk -F', ' '{print $3}')\n  # check if total key is a number\n  case $totalKey in\n    *[!0-9]*)\n      echo \"ERROR: snapshot totalKey is not a valid number.\"\n      exit 1\n      ;;\n  esac\n\n  # define a threshold to check if the total key count is too low\n  # consider increasing this value when dealing with production-grade etcd cluster\n  threshold=$BACKUP_KEY_THRESHOLD #[modifiable]\n  if [ \"$totalKey\" -lt $threshold ]; then\n    echo \"WARNING: snapshot totalKey is less than the threshold\"\n    exit 1\n  fi\n}\n\ngetClientProtocol() {\n  # check client tls if is enabled\n  line=$(grep 'advertise-client-urls' ${configFile})\n  if echo $line | grep -q 'https'; then\n    echo \"https\"\n  elif echo $line | grep -q 'http'; then\n    echo \"http\"\n  fi\n}\n\ngetPeerProtocol() {\n  # check peer tls if is enabled\n  line=$(grep 'initial-advertise-peer-urls' ${configFile})\n  if echo $line | grep -q 'https'; then\n    echo \"https\"\n  elif echo $line | grep -q 'http'; then\n    echo \"http\"\n  fi\n}\n\nexecEtcdctl() {\n  local endpoints=$1\n  shift\n  clientProtocol=$(getClientProtocol)\n  tlsDir=$TLS_MOUNT_PATH\n  # check if the clientProtocol is https and the tlsDir is not empty\n  if [ $clientProtocol = \"https\" ] \u0026\u0026 [ -d \"$tlsDir\" ] \u0026\u0026 [ -s \"${tlsDir}/ca.crt\" ] \u0026\u0026 [ -s \"${tlsDir}/tls.crt\" ] \u0026\u0026 [ -s \"${tlsDir}/tls.key\" ]; then\n    etcdctl --endpoints=${endpoints} --cacert=${tlsDir}/ca.crt --cert=${tlsDir}/tls.crt --key=${tlsDir}/tls.key \"$@\"\n  elif [ $clientProtocol = \"http\" ]; then\n    etcdctl --endpoints=${endpoints} \"$@\"\n  else\n    echo \"ERROR: bad etcdctl args: clientProtocol:${clientProtocol}, endpoints:${endpoints}, tlsDir:${tlsDir}, please check!\"\n    exit 1\n  fi\n  # check if the etcdctl command was successful\n  if [ $? -ne 0 ]; then\n    echo \"etcdctl command failed\"\n    exit 1\n  fi\n}\n\n# this function will be deprecated in the future\nexecEtcdctlNoCheckTLS() {\n  local endpoints=$1\n  shift\n  etcdctl --endpoints=${endpoints} \"$@\"\n  # check if the etcdctl command was successful\n  if [ $? -ne 0 ]; then\n    echo \"etcdctl command failed\"\n    exit 1\n  fi\n}\n\nupdateLeaderIfNeeded() {\n  local retries=$1\n\n  if [ $retries -le 0 ]; then\n    echo \"Maximum number of retries reached, leader is not ready\"\n    exit 1\n  fi\n\n  status=$(execEtcdctlNoCheckTLS ${leaderEndpoint} endpoint status)\n  isLeader=$(echo $status | awk -F ', ' '{print $5}')\n  if [ \"$isLeader\" = \"false\" ]; then\n    echo \"leader out of status, try to redirect to new leader\"\n    peerEndpoints=$(execEtcdctlNoCheckTLS \"$leaderEndpoint\" member list | awk -F', ' '{print $5}' | tr '\\n' ',' | sed 's#,$##')\n    leaderEndpoint=$(execEtcdctlNoCheckTLS \"$peerEndpoints\" endpoint status | awk -F', ' '$5==\"true\" {print $1}')\n    if [ $leaderEndpoint = \"\" ]; then\n      echo \"leader is not ready, wait for 2s...\"\n      sleep 2\n      updateLeaderIfNeeded $(expr $retries - 1)\n    fi\n  fi\n}\n#!/bin/sh\n\nstatus=$(execEtcdctl 127.0.0.1:2379 endpoint status --command-timeout=300ms --dial-timeout=100m)\nIsLeader=$(echo $status | awk -F ', ' '{print $5}')\nIsLearner=$(echo $status | awk -F ', ' '{print $6}')\n\nif [ \"true\" = \"$IsLeader\" ]; then\n  echo -n \"leader\"\nelif [ \"true\" = \"$IsLearner\" ]; then\n  echo -n \"learner\"\nelif [ \"false\" = \"$IsLeader\" ] \u0026\u0026 [ \"false\" = \"$IsLearner\" ]; then\n  echo -n \"follower\"\nelse\n  echo -n \"bad role, please check!\"\n  exit 1\nfi\n"]}}]
      KB_AGENT_PROBE:   [{"action":"roleProbe"}]
    Mounts:
      /kubeblocks from kubeblocks (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-bmht2 (ro)
    Conditions:
    Type                        Status
    PodReadyToStartContainers   True 
    Initialized                 True 
    Ready                       False 
    ContainersReady             False 
    PodScheduled                True 
    Volumes:
    shell:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
    config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      etcd-ptyfua-etcd-etcd-configuration-tpl
    Optional:  false
    scripts:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      etcd-ptyfua-etcd-etcd-scripts
    Optional:  false
    data:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  data-etcd-ptyfua-etcd-2
    ReadOnly:   false
    kubeblocks:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
    kube-api-access-bmht2:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
    QoS Class:                   Burstable
    Node-Selectors:              <none>
    Tolerations:                 kb-data=true:NoSchedule
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
    Events:
    Type     Reason     Age                   From               Message
    ----     ------     ----                  ----               -------
    Normal   Scheduled  2m27s                 default-scheduler  Successfully assigned default/etcd-ptyfua-etcd-2 to gke-dhtest-gke-dhtest-gke-05a50c4d-dzqd
    Normal   Pulled     2m23s                 kubelet            Container image "docker.io/busybox:1.35-musl" already present on machine
    Normal   Created    2m23s                 kubelet            Created container inject-shell
    Normal   Started    2m23s                 kubelet            Started container inject-shell
    Normal   Pulled     2m19s                 kubelet            Container image "docker.io/apecloud/kubeblocks-tools:1.0.0-alpha.5" already present on machine
    Normal   Created    2m19s                 kubelet            Created container init-kbagent
    Normal   Started    2m19s                 kubelet            Started container init-kbagent
    Normal   Started    2m17s                 kubelet            Started container kbagent
    Normal   Pulled     2m17s                 kubelet            Container image "docker.io/apecloud/etcd:v3.5.6" already present on machine
    Normal   Created    2m17s                 kubelet            Created container kbagent
    Warning  Failed     116s (x3 over 2m17s)  kubelet            Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error setting cgroup config for procHooks process: openat2 /sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod053fc1c3_d474_460e_9f3c_88f7a763d2b9.slice/cri-containerd-etcd.scope/cgroup.controllers: no such file or directory: unknown
    Warning  BackOff    102s (x5 over 2m15s)  kubelet            Back-off restarting failed container etcd in pod etcd-ptyfua-etcd-2_default(053fc1c3-d474-460e-9f3c-88f7a763d2b9)
    Normal   Created    91s (x4 over 2m18s)   kubelet            Created container etcd
    Normal   Pulled     91s (x4 over 2m18s)   kubelet            Container image "docker.io/apecloud/etcd:v3.5.15" already present on machine
    Normal   roleProbe  77s                   kbagent            {"probe":"roleProbe","code":-1,"message":"grep: /var/run/etcd/etcd.conf: No such file or directory\n/bin/sh: 59: [: =: unexpected operator\n/bin/sh: 61: [: =: unexpected operator\n: failed"}
    Normal   roleProbe  17s                   kbagent            {"probe":"roleProbe","code":-1,"message":"grep: /var/run/etcd/etcd.conf: No such file or directory\n/bin/sh: 59: [: =: unexpected operator\n/bin/sh: 61: [: =: unexpected operator\n: failed"}

Expected behavior A clear and concise description of what you expected to happen.

Screenshots If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

Additional context Add any other context about the problem here.

github-actions[bot] commented 1 day ago

This issue has been marked as stale because it has been open for 30 days with no activity