Altinity / clickhouse-operator

Altinity Kubernetes Operator for ClickHouse creates, configures and manages ClickHouse clusters running on Kubernetes
https://altinity.com
Apache License 2.0
1.86k stars 454 forks source link

A clickhouse-keeper node cannot get up after installation #1471

Open liubo-it opened 1 month ago

liubo-it commented 1 month ago

image

liubo-it commented 1 month ago

@Slach Can you help me? I'm following the documentation

Slach commented 1 month ago

please, stop share text as images, this is mental degradation

Which instruction did you follow exactly, share link?

liubo-it commented 1 month ago

please, stop share text as images, this is mental degradation

Which instruction did you follow exactly, share link?

sry,I refer to the following document to deploy clickhouse-keeper, I get an error when I start clickhouse-keeper-02 pod

error

4.08.03 05:14:40.671867 [ 22 ] {} <Debug> KeeperSnapshotManagerS3: Shutting down KeeperSnapshotManagerS3
2024.08.03 05:14:40.671899 [ 22 ] {} <Information> KeeperSnapshotManagerS3: KeeperSnapshotManagerS3 shut down
2024.08.03 05:14:40.671911 [ 22 ] {} <Debug> KeeperDispatcher: Dispatcher shut down
2024.08.03 05:14:40.672404 [ 22 ] {} <Error> Application: Code: 568. DB::Exception: At least one of servers should be able to start as leader (without <start_as_follower>). (RAFT_ERROR), Stack trace (when copying this message, always include the lines below):

0. DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0x0000000000e42fdb in /usr/bin/clickhouse-keeper
1. DB::Exception::Exception<char const (&) [88]>(int, char const (&) [88]) @ 0x000000000086a740 in /usr/bin/clickhouse-keeper
2. DB::KeeperStateManager::parseServersConfiguration(Poco::Util::AbstractConfiguration const&, bool, bool) const @ 0x0000000000869595 in /usr/bin/clickhouse-keeper
3. DB::KeeperStateManager::KeeperStateManager(int, String const&, String const&, Poco::Util::AbstractConfiguration const&, std::shared_ptr<DB::CoordinationSettings> const&, std::shared_ptr<DB::KeeperContext>) @ 0x000000000086b08b in /usr/bin/clickhouse-keeper
4. DB::KeeperServer::KeeperServer(std::shared_ptr<DB::KeeperConfigurationAndSettings> const&, Poco::Util::AbstractConfiguration const&, ConcurrentBoundedQueue<DB::KeeperStorage::ResponseForSession>&, ConcurrentBoundedQueue<DB::CreateSnapshotTask>&, std::shared_ptr<DB::KeeperContext>, DB::KeeperSnapshotManagerS3&, std::function<void (unsigned long, DB::KeeperStorage::RequestForSession const&)>) @ 0x0000000000802bc1 in /usr/bin/clickhouse-keeper
5. DB::KeeperDispatcher::initialize(Poco::Util::AbstractConfiguration const&, bool, bool, std::shared_ptr<DB::Macros const> const&) @ 0x00000000007e81c6 in /usr/bin/clickhouse-keeper
6. DB::Context::initializeKeeperDispatcher(bool) const @ 0x0000000000a5bb06 in /usr/bin/clickhouse-keeper
7. DB::Keeper::main(std::vector<String, std::allocator<String>> const&) @ 0x0000000000b771e9 in /usr/bin/clickhouse-keeper
8. Poco::Util::Application::run() @ 0x0000000000ffbf26 in /usr/bin/clickhouse-keeper
9. DB::Keeper::run() @ 0x0000000000b73f7e in /usr/bin/clickhouse-keeper
10. Poco::Util::ServerApplication::run(int, char**) @ 0x0000000001012d39 in /usr/bin/clickhouse-keeper
11. mainEntryClickHouseKeeper(int, char**) @ 0x0000000000b72ef8 in /usr/bin/clickhouse-keeper
12. main @ 0x0000000000b81b1d in /usr/bin/clickhouse-keeper
 (version 23.10.5.20 (official build))
2024.08.03 05:14:40.672441 [ 22 ] {} <Error> Application: DB::Exception: At least one of servers should be able to start as leader (without <start_as_follower>)
2024.08.03 05:14:40.672446 [ 22 ] {} <Information> Application: shutting down
2024.08.03 05:14:40.672449 [ 22 ] {} <Debug> Application: Uninitializing subsystem: Logging Subsystem
2024.08.03 05:14:40.672565 [ 23 ] {} <Trace> BaseDaemon: Received signal -2
2024.08.03 05:14:40.672601 [ 23 ] {} <Information> BaseDaemon: Stop SignalListener thread

reference file https://github.com/Altinity/clickhouse-operator/blob/master/deploy/clickhouse-keeper/clickhouse-keeper-manually/clickhouse-keeper-3-nodes.yaml

describe

image

k8s resource file

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: wukong-clickhouse-keeper-local-storage
provisioner: kubernetes.io/no-provisioner
volumeBindingMode: WaitForFirstConsumer
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: clickhouse-keeper-local-pv-0
  namespace:  wukong-application
  labels:
    name: clickhouse-keeper-local-pv-0
spec:
  capacity:
    storage: 50Gi
  accessModes:
  - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  storageClassName: wukong-clickhouse-keeper-local-storage
  hostPath:
    path: /data/tingyun/wukong/tingyun/common/clickhouse-keeper/data0
    type: DirectoryOrCreate
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: kubernetes.io/hostname
          operator: In
          values:
          - 10.128.9.10
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: clickhouse-keeper-local-pv-1
  namespace:  wukong-application
  labels:
    name: clickhouse-keeper-local-pv-1
spec:
  capacity:
    storage: 50Gi
  accessModes:
  - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  storageClassName: wukong-clickhouse-keeper-local-storage
  hostPath:
    path: /data/tingyun/wukong/tingyun/common/clickhouse-keeper/data1
    type: DirectoryOrCreate
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: kubernetes.io/hostname
          operator: In
          values:
          - 10.128.9.10
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: clickhouse-keeper-local-pv-2
  namespace:  wukong-application
  labels:
    name: clickhouse-keeper-local-pv-2
spec:
  capacity:
    storage: 50Gi
  accessModes:
  - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  storageClassName: wukong-clickhouse-keeper-local-storage
  hostPath:
    path: /data/tingyun/wukong/tingyun/common/clickhouse-keeper/data2
    type: DirectoryOrCreate
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: kubernetes.io/hostname
          operator: In
          values:
          - 10.128.9.10
---
apiVersion: v1
kind: Service
metadata:
  name: wukong-clickhouse-keeper-hs
  namespace: wukong-application
  labels:
    app: wukong-clickhouse-keeper
spec:
  ports:
  - port:  9234
    name: raft
  clusterIP: None
  selector:
    app: wukong-clickhouse-keeper
---
apiVersion: v1
kind: Service
metadata:
  name: wukong-clickhouse-keeper
  namespace: wukong-application
  labels:
    app: wukong-clickhouse-keeper
  annotations:
    service.alpha.kubernetes.io/tolerate-unready-endpoints: "true"
    prometheus.io/port: "9363"
    prometheus.io/scrape: "true"
spec:
  ports:
  - port: 2181
    name: client
  - port: 9363
    name: prometheus
  selector:
    app: wukong-clickhouse-keeper
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: wukong-clickhouse-keeper
  namespace:  wukong-application
  labels:
    app: wukong-clickhouse-keeper
data:
  keeper_config.xml: |
    <clickhouse>
        <include_from>/tmp/clickhouse-keeper/config.d/generated-keeper-settings.xml</include_from>
        <logger>
            <level>trace</level>
            <console>true</console>
        </logger>
        <listen_host>::</listen_host>
        <keeper_server incl="keeper_server">
            <enable_reconfiguration>true</enable_reconfiguration>
            <path>/var/lib/clickhouse-keeper</path>
            <tcp_port>2181</tcp_port>
            <four_letter_word_white_list>*</four_letter_word_white_list>
            <coordination_settings>
                <!-- <raft_logs_level>trace</raft_logs_level> -->
                <raft_logs_level>information</raft_logs_level>
            </coordination_settings>
        </keeper_server>
        <prometheus>
            <endpoint>/metrics</endpoint>
            <port>9363</port>
            <metrics>true</metrics>
            <events>true</events>
            <asynchronous_metrics>true</asynchronous_metrics>
            <status_info>true</status_info>
        </prometheus>
    </clickhouse>
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: wukong-clickhouse-keeper-scripts
  namespace:  wukong-application
  labels:
    app: wukong-clickhouse-keeper-scripts
data:
  env.sh: |
    #!/usr/bin/env bash
    export DOMAIN=`hostname -d`
    export CLIENT_HOST=clickhouse-keeper
    export CLIENT_PORT=2181
    export RAFT_PORT=9234
  keeperFunctions.sh: |
    #!/usr/bin/env bash
    set -ex
    function keeperConfig() {
      echo "$HOST.$DOMAIN:$RAFT_PORT;$ROLE;$WEIGHT"
    }
    function keeperConnectionString() {
      # If the client service address is not yet available, then return localhost
      set +e
      getent hosts "${CLIENT_HOST}" 2>/dev/null 1>/dev/null
      if [[ $? -ne 0 ]]; then
        set -e
        echo "-h localhost -p ${CLIENT_PORT}"
      else
        set -e
        echo "-h ${CLIENT_HOST} -p ${CLIENT_PORT}"
      fi
    }

  keeperStart.sh: |
    #!/usr/bin/env bash
    set -ex
    source /conf/env.sh
    source /conf/keeperFunctions.sh

    HOST=`hostname -s`
    if [[ $HOST =~ (.*)-([0-9]+)$ ]]; then
      NAME=${BASH_REMATCH[1]}
      ORD=${BASH_REMATCH[2]}
    else
      echo Failed to parse name and ordinal of Pod
      exit 1
    fi
    export MY_ID=$((ORD+1))
    set +e
    getent hosts $DOMAIN
    if [[ $? -eq 0 ]]; then
      ACTIVE_ENSEMBLE=true
    else
      ACTIVE_ENSEMBLE=false
    fi
    set -e
    mkdir -p /tmp/clickhouse-keeper/config.d/
    if [[ "true" == "${ACTIVE_ENSEMBLE}" ]]; then
      # get current config from clickhouse-keeper
      CURRENT_KEEPER_CONFIG=$(clickhouse-keeper-client --history-file=/dev/null -h ${CLIENT_HOST} -p ${CLIENT_PORT} -q "get /keeper/config" || true)
      # generate dynamic config, add current server to xml
      {
        echo "<yandex><keeper_server>"
        echo "<server_id>${MY_ID}</server_id>"
        echo "<raft_configuration>"
        if [[ "0" == $(echo "${CURRENT_KEEPER_CONFIG}" | grep -c "${HOST}.${DOMAIN}") ]]; then
          echo "<server><id>${MY_ID}</id><hostname>${HOST}.${DOMAIN}</hostname><port>${RAFT_PORT}</port><priority>1</priority><start_as_follower>true</start_as_follower></server>"
        fi
        while IFS= read -r line; do
          id=$(echo "$line" | cut -d '=' -f 1 | cut -d '.' -f 2)
          if [[ "" != "${id}" ]]; then
            hostname=$(echo "$line" | cut -d '=' -f 2 | cut -d ';' -f 1 | cut -d ':' -f 1)
            port=$(echo "$line" | cut -d '=' -f 2 | cut -d ';' -f 1 | cut -d ':' -f 2)
            priority=$(echo "$line" | cut -d ';' -f 3)
            priority=${priority:-1}
            port=${port:-$RAFT_PORT}
            echo "<server><id>$id</id><hostname>$hostname</hostname><port>$port</port><priority>$priority</priority></server>"
          fi
        done <<< "$CURRENT_KEEPER_CONFIG"
        echo "</raft_configuration>"
        echo "</keeper_server></yandex>"
      } > /tmp/clickhouse-keeper/config.d/generated-keeper-settings.xml
    else
      # generate dynamic config, add current server to xml
      {
        echo "<yandex><keeper_server>"
        echo "<server_id>${MY_ID}</server_id>"
        echo "<raft_configuration>"
        echo "<server><id>${MY_ID}</id><hostname>${HOST}.${DOMAIN}</hostname><port>${RAFT_PORT}</port><priority>1</priority></server>"
        echo "</raft_configuration>"
        echo "</keeper_server></yandex>"
      } > /tmp/clickhouse-keeper/config.d/generated-keeper-settings.xml
    fi

    # run clickhouse-keeper
    cat /tmp/clickhouse-keeper/config.d/generated-keeper-settings.xml
    rm -rfv /var/lib/clickhouse-keeper/terminated
    clickhouse-keeper --config-file=/etc/clickhouse-keeper/keeper_config.xml

  keeperTeardown.sh: |
    #!/usr/bin/env bash
    set -ex
    exec > /proc/1/fd/1
    exec 2> /proc/1/fd/2
    source /conf/env.sh
    source /conf/keeperFunctions.sh
    set +e
    KEEPER_URL=$(keeperConnectionString)
    set -e
    HOST=`hostname -s`
    if [[ $HOST =~ (.*)-([0-9]+)$ ]]; then
        NAME=${BASH_REMATCH[1]}
        ORD=${BASH_REMATCH[2]}
    else
        echo Failed to parse name and ordinal of Pod
        exit 1
    fi
    export MY_ID=$((ORD+1))

    CURRENT_KEEPER_CONFIG=$(clickhouse-keeper-client --history-file=/dev/null -h localhost -p ${CLIENT_PORT} -q "get /keeper/config")
    CLUSTER_SIZE=$(echo -e "${CURRENT_KEEPER_CONFIG}" | grep -c -E '^server\.[0-9]+=')
    echo "CLUSTER_SIZE=$CLUSTER_SIZE, MyId=$MY_ID"
    # If CLUSTER_SIZE > 1, this server is being permanently removed from raft_configuration.
    if [[ "$CLUSTER_SIZE" -gt "1" ]]; then
      clickhouse-keeper-client --history-file=/dev/null -q "reconfig remove $MY_ID" ${KEEPER_URL}
    fi

    # Wait to remove $MY_ID from quorum
    # for (( i = 0; i < 6; i++ )); do
    #    CURRENT_KEEPER_CONFIG=$(clickhouse-keeper-client --history-file=/dev/null -h localhost -p ${CLIENT_PORT} -q "get /keeper/config")
    #    if [[ "0" == $(echo -e "${CURRENT_KEEPER_CONFIG}" | grep -c -E "^server.${MY_ID}=$HOST.+participant;[0-1]$") ]]; then
    #      echo "$MY_ID removed from quorum"
    #      break
    #    else
    #      echo "$MY_ID still present in quorum"
    #    fi
    #    sleep 1
    # done

    # Wait for client connections to drain. Kubernetes will wait until the configured
    # "terminationGracePeriodSeconds" before forcibly killing the container
    for (( i = 0; i < 3; i++ )); do
      CONN_COUNT=`echo $(exec 3<>/dev/tcp/127.0.0.1/2181 ; printf "cons" >&3 ; IFS=; tee <&3; exec 3<&- ;) | grep -v "^$" | grep -v "127.0.0.1" | wc -l`
      if [[ "$CONN_COUNT" -gt "0" ]]; then
        echo "$CONN_COUNT non-local connections still connected."
        sleep 1
      else
        echo "$CONN_COUNT non-local connections"
        break
      fi
    done

    touch /var/lib/clickhouse-keeper/terminated
    # Kill the primary process ourselves to circumvent the terminationGracePeriodSeconds
    ps -ef | grep clickhouse-keeper | grep -v grep | awk '{print $1}' | xargs kill

  keeperLive.sh: |
    #!/usr/bin/env bash
    set -ex
    source /conf/env.sh
    OK=$(exec 3<>/dev/tcp/127.0.0.1/${CLIENT_PORT} ; printf "ruok" >&3 ; IFS=; tee <&3; exec 3<&- ;)
    # Check to see if keeper service answers
    if [[ "$OK" == "imok" ]]; then
      exit 0
    else
      exit 1
    fi

  keeperReady.sh: |
    #!/usr/bin/env bash
    set -ex
    exec > /proc/1/fd/1
    exec 2> /proc/1/fd/2
    source /conf/env.sh
    source /conf/keeperFunctions.sh

    HOST=`hostname -s`

    # Check to see if clickhouse-keeper service answers
    set +e
    getent hosts $DOMAIN
    if [[ $? -ne 0 ]]; then
      echo "no active DNS records in service, first running pod"
      exit 0
    elif [[ -f /var/lib/clickhouse-keeper/terminated ]]; then
      echo "termination in progress"
      exit 0
    else
      set -e
      # An ensemble exists, check to see if this node is already a member.
      # Extract resource name and this members' ordinal value from pod hostname
      if [[ $HOST =~ (.*)-([0-9]+)$ ]]; then
        NAME=${BASH_REMATCH[1]}
        ORD=${BASH_REMATCH[2]}
      else
        echo "Failed to parse name and ordinal of Pod"
        exit 1
      fi
      MY_ID=$((ORD+1))

      CURRENT_KEEPER_CONFIG=$(clickhouse-keeper-client --history-file=/dev/null -h ${CLIENT_HOST} -p ${CLIENT_PORT} -q "get /keeper/config" || exit 0)
      # Check to see if clickhouse-keeper for this node is a participant in raft cluster
      if [[ "1" == $(echo -e "${CURRENT_KEEPER_CONFIG}" | grep -c -E "^server.${MY_ID}=${HOST}.+participant;1$") ]]; then
        echo "clickhouse-keeper instance is available and an active participant"
        exit 0
      else
        echo "clickhouse-keeper instance is ready to add as participant with 1 weight."

        ROLE=participant
        WEIGHT=1
        KEEPER_URL=$(keeperConnectionString)
        NEW_KEEPER_CONFIG=$(keeperConfig)
        clickhouse-keeper-client --history-file=/dev/null -q "reconfig add 'server.$MY_ID=$NEW_KEEPER_CONFIG'" ${KEEPER_URL}
        exit 0
      fi
    fi
---
# Setup ClickHouse Keeper StatefulSet
apiVersion: apps/v1
kind: StatefulSet
metadata:
  # nodes would be named as clickhouse-keeper-0, clickhouse-keeper-1, clickhouse-keeper-2
  name: wukong-clickhouse-keeper
  namespace:  wukong-application
  labels:
    app: wukong-clickhouse-keeper
spec:
  selector:
    matchLabels:
      app: wukong-clickhouse-keeper
  serviceName:  wukong-clickhouse-keeper-hs
  replicas: 3
  template:
    metadata:
      labels:
        app: wukong-clickhouse-keeper
      annotations:
        prometheus.io/port: '9363'
        prometheus.io/scrape: 'true'
    spec:
      volumes:
        - name: wukong-clickhouse-keeper-settings
          configMap:
            name: wukong-clickhouse-keeper
            items:
              - key: keeper_config.xml
                path: keeper_config.xml
        - name: wukong-clickhouse-keeper-scripts
          configMap:
            name: wukong-clickhouse-keeper-scripts
            defaultMode: 0755
      containers:
        - name: wukong-clickhouse-keeper
          imagePullPolicy: IfNotPresent
          image: "ccr.ccs.tencentyun.com/wukong-common/clickhouse-keeper:23.10.5.20"
          resources:
            requests:
              memory: "256M"
              cpu: "100m"
            limits:
              memory: "4Gi"
              cpu: "1000m"
          volumeMounts:
            - name: wukong-clickhouse-keeper-settings
              mountPath: /etc/clickhouse-keeper/
            - name: wukong-clickhouse-keeper-scripts
              mountPath: /conf/
            - name: data
              mountPath: /var/lib/clickhouse-keeper
          command:
            - /conf/keeperStart.sh
          lifecycle:
            preStop:
              exec:
                command:
                  - /conf/keeperTeardown.sh
          livenessProbe:
            exec:
              command:
                - /conf/keeperLive.sh
            initialDelaySeconds: 60
            timeoutSeconds: 10
          readinessProbe:
            exec:
              command:
                - /conf/keeperReady.sh
            initialDelaySeconds: 60
            timeoutSeconds: 10
          ports:
            - containerPort: 2181
              name: client
              protocol: TCP
            - containerPort: 9234
              name: quorum
              protocol: TCP
            - containerPort: 9363
              name: metrics
              protocol: TCP
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName:  wukong-clickhouse-keeper-local-storage
      resources:
        requests:
          storage: 50Gi
liubo-it commented 1 month ago

This problem also exists when I use heml chart

command helm install clickhouse-keeper --generate-name

link https://artifacthub.io/packages/helm/duyet/clickhouse-keeper?modal=install

screenshot image

image

Slach commented 1 month ago

this is not officiall helm chart

did you run kubectl apply -n <namespace> -f https://github.com/Altinity/clickhouse-operator/blob/master/deploy/clickhouse-keeper/clickhouse-keeper-manually/clickhouse-keeper-3-nodes.yaml

only once?? or do something else?

Application: Code: 568. DB::Exception: At least one of servers should be able to start as leader (without <start_as_follower>)

try to execute on live pods

clickhouse-keeper client -q "get /keeper/config"
grep -C 10 start_as_follower -r /etc/clickhouse-keeper/
Slach commented 1 month ago

@liubo-it could you check kubectl apply -n <namespace> -f https://github.com/Altinity/clickhouse-operator/blob/0.24.0/deploy/clickhouse-keeper/clickhouse-keeper-manually/clickhouse-keeper-3-nodes.yaml ?

Slach commented 4 days ago

@liubo-it any news from your side?

liubo-it commented 3 days ago

any news from your side?

That's still the case, so I went the other way