Clickhouse-Keeper on Kubernetes - Node/Pod Restart Issues

tman5 commented 1 year ago

Upon rebooting an underlying Kubernetes node or re-creating a StateFul set for clickhouse-keeper in k8s, sometimes the pods will come back and be in a CrashLoop state with errors such as:

2023.10.03 18:13:18.061959 [ 30 ] {} <Error> bool DB::KeeperStateMachine::preprocess(const KeeperStorage::RequestForSession &): Failed to preprocess stored log, aborting to avoid inconsistent state: Code: 49. DB::Exception: Got new ZXID (52749) smaller or equal to current ZXID (52750). It's a bug. (LOGICAL_ERROR), Stack trace (when copying this message, always include the lines below):

0. DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0x0000000000e3247b in /usr/bin/clickhouse-keeper
1. DB::Exception::Exception<long&, long&>(int, FormatStringHelperImpl<std::type_identity<long&>::type, std::type_identity<long&>::type>, long&, long&) @ 0x0000000000871562 in /usr/bin/clickhouse-keeper
2. DB::KeeperStorage::preprocessRequest(std::shared_ptr<Coordination::ZooKeeperRequest> const&, long, long, long, bool, std::optional<DB::KeeperStorage::Digest>, long) @ 0x00000000008748c6 in /usr/bin/clickhouse-keeper
3. DB::KeeperStateMachine::preprocess(DB::KeeperStorage::RequestForSession const&) @ 0x000000000084ffa1 in /usr/bin/clickhouse-keeper
4. DB::KeeperStateMachine::pre_commit(unsigned long, nuraft::buffer&) @ 0x000000000084e762 in /usr/bin/clickhouse-keeper
5. nuraft::raft_server::handle_append_entries(nuraft::req_msg&) @ 0x0000000000c40371 in /usr/bin/clickhouse-keeper
6. nuraft::raft_server::process_req(nuraft::req_msg&, nuraft::raft_server::req_ext_params const&) @ 0x0000000000c0706e in /usr/bin/clickhouse-keeper
7. nuraft::rpc_session::read_complete(std::shared_ptr<nuraft::buffer>, std::shared_ptr<nuraft::buffer>) @ 0x0000000000bec4a8 in /usr/bin/clickhouse-keeper
8. nuraft::rpc_session::read_log_data(std::shared_ptr<nuraft::buffer>, boost::system::error_code const&, unsigned long) @ 0x0000000000bed14a in /usr/bin/clickhouse-keeper
9. boost::asio::detail::read_op<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::any_io_executor>, boost::asio::mutable_buffers_1, boost::asio::mutable_buffer const*, boost::asio::detail::transfer_all_t, std::__bind<void (nuraft::rpc_session::*)(std::shared_ptr<nuraft::buffer>, boost::system::error_code const&, unsigned long), std::shared_ptr<nuraft::rpc_session> const&, std::shared_ptr<nuraft::buffer>&, std::placeholders::__ph<1> const&, std::placeholders::__ph<2> const&>>::operator()(boost::system::error_code, unsigned long, int) @ 0x0000000000bf8e13 in /usr/bin/clickhouse-keeper
10. boost::asio::detail::reactive_socket_recv_op<boost::asio::mutable_buffers_1, boost::asio::detail::read_op<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::any_io_executor>, boost::asio::mutable_buffers_1, boost::asio::mutable_buffer const*, boost::asio::detail::transfer_all_t, std::__bind<void (nuraft::rpc_session::*)(std::shared_ptr<nuraft::buffer>, boost::system::error_code const&, unsigned long), std::shared_ptr<nuraft::rpc_session> const&, std::shared_ptr<nuraft::buffer>&, std::placeholders::__ph<1> const&, std::placeholders::__ph<2> const&>>, boost::asio::any_io_executor>::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long) @ 0x0000000000bf9739 in /usr/bin/clickhouse-keeper
11. boost::asio::detail::scheduler::run(boost::system::error_code&) @ 0x0000000000ba9eb9 in /usr/bin/clickhouse-keeper
12. nuraft::asio_service_impl::worker_entry() @ 0x0000000000ba2db2 in /usr/bin/clickhouse-keeper
13. void* std::__thread_proxy[abi:v15000]<std::tuple<std::unique_ptr<std::__thread_struct, std::default_delete<std::__thread_struct>>, std::__bind<void (nuraft::asio_service_impl::*)(), nuraft::asio_service_impl*>>>(void*) @ 0x0000000000bafa1a in /usr/bin/clickhouse-keeper
 (version 23.10.1.134 (official build))
2023.10.03 18:13:18.063310 [ 12 ] {} <Trace> BaseDaemon: Received signal 6
2023.10.03 18:13:18.063510 [ 34 ] {} <Fatal> BaseDaemon: ########## Short fault info ############
2023.10.03 18:13:18.063531 [ 34 ] {} <Fatal> BaseDaemon: (version 23.10.1.134 (official build), build id: 8062681F29E88365723319EBF0E96CB2D91AA0D5, git hash: 7b6548157c9216eb52867ba10c8c9bd30a3492dd) (from thread 30) Received signal 6
2023.10.03 18:13:18.063538 [ 34 ] {} <Fatal> BaseDaemon: Signal description: Aborted
2023.10.03 18:13:18.063542 [ 34 ] {} <Fatal> BaseDaemon: 
2023.10.03 18:13:18.063548 [ 34 ] {} <Fatal> BaseDaemon: Stack trace: 0x0000000000a3ff30
2023.10.03 18:13:18.063551 [ 34 ] {} <Fatal> BaseDaemon: ########################################
2023.10.03 18:13:18.063559 [ 34 ] {} <Fatal> BaseDaemon: (version 23.10.1.134 (official build), build id: 8062681F29E88365723319EBF0E96CB2D91AA0D5, git hash: 7b6548157c9216eb52867ba10c8c9bd30a3492dd) (from thread 30) (no query) Received signal Aborted (6)
2023.10.03 18:13:18.063563 [ 34 ] {} <Fatal> BaseDaemon: 
2023.10.03 18:13:18.063566 [ 34 ] {} <Fatal> BaseDaemon: Stack trace: 0x0000000000a3ff30
2023.10.03 18:13:18.063619 [ 34 ] {} <Fatal> BaseDaemon: 0. signalHandler(int, siginfo_t*, void*) @ 0x0000000000a3ff30 in /usr/bin/clickhouse-keeper
2023.10.03 18:13:18.063629 [ 34 ] {} <Fatal> BaseDaemon: Integrity check of the executable skipped because the reference checksum could not be read.
2023.10.03 18:13:18.063636 [ 34 ] {} <Fatal> BaseDaemon: Report this error to https://github.com/ClickHouse/ClickHouse/issues
bash: line 27:    11 Aborted                 (core dumped) clickhouse-keeper --config-file=/etc/clickhouse-keeper/keeper_config.xml

clickhouse-keeper version 23.9.1

This issue appears to be similar to this one https://github.com/ClickHouse/ClickHouse/issues/42668 however this one is on K8s using stateful set. This is the manifest we are using: https://github.com/Altinity/clickhouse-operator/blob/master/deploy/clickhouse-keeper/clickhouse-keeper-3-nodes.yaml

It seems like it's an order of operations/race condition issue. I can't reproduce it reliably. Sometimes node reboots work fine. Other times the clickhouse-keeper pods will come up in this crashloop state.

A "fix" is to delete the pod and PVC and let it re-create. That will bring it back but it's not a long term solution.

antonio2368 commented 1 year ago

It's hard to say what happened without the logs before shutdown and during the first failed start to see where the error happened. It's possible something was incorrectly applied.

tman5 commented 1 year ago

Logs attached keeper_logs.zip

Manifests here:

---
# Setup Service to provide access to ClickHouse keeper for clients
apiVersion: v1
kind: Service
metadata:
  # DNS would be like clickhouse-keeper.namespace.svc
  name: clickhouse-keeper
  labels:
    app: clickhouse-keeper
spec:
  ports:
    - port: 2181
      name: client
    - port: 7000
      name: prometheus
  selector:
    app: clickhouse-keeper
    what: node
---
# Setup Headless Service for StatefulSet
apiVersion: v1
kind: Service
metadata:
  # DNS would be like clickhouse-keeper-0.clickhouse-keepers.namespace.svc
  name: clickhouse-keepers
  labels:
    app: clickhouse-keeper
spec:
  ports:
    - port: 9444
      name: raft
  clusterIP: None
  selector:
    app: clickhouse-keeper
    what: node
---
# Setup max number of unavailable pods in StatefulSet
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: clickhouse-keeper-pod-disruption-budget
spec:
  selector:
    matchLabels:
      app: clickhouse-keeper
  maxUnavailable: 1
---
# Setup ClickHouse Keeper settings
apiVersion: v1
kind: ConfigMap
metadata:
  name: clickhouse-keeper-settings
data:
  keeper_config.xml: |
    <clickhouse>
        <include_from>/tmp/clickhouse-keeper/config.d/generated-keeper-settings.xml</include_from>
        <logger>
            <level>trace</level>
            <console>true</console>
        </logger>
        <listen_host>0.0.0.0</listen_host>
        <keeper_server incl="keeper_server">
            <path>/var/lib/clickhouse-keeper</path>
            <tcp_port>2181</tcp_port>
            <coordination_settings>
                <!-- <raft_logs_level>trace</raft_logs_level> -->
                <raft_logs_level>information</raft_logs_level>
            </coordination_settings>
        </keeper_server>
        <prometheus>
            <endpoint>/metrics</endpoint>
            <port>7000</port>
            <metrics>true</metrics>
            <events>true</events>
            <asynchronous_metrics>true</asynchronous_metrics>
            <!-- https://github.com/ClickHouse/ClickHouse/issues/46136 -->
            <status_info>false</status_info>
        </prometheus>
    </clickhouse>

---
# Setup ClickHouse Keeper StatefulSet
apiVersion: apps/v1
kind: StatefulSet
metadata:
  # nodes would be named as clickhouse-keeper-0, clickhouse-keeper-1, clickhouse-keeper-2
  name: clickhouse-keeper
  labels:
    app: clickhouse-keeper
spec:
  selector:
    matchLabels:
      app: clickhouse-keeper
  serviceName: clickhouse-keepers
  replicas: 3
  updateStrategy:
    type: RollingUpdate
  podManagementPolicy: Parallel
  template:
    metadata:
      labels:
        app: clickhouse-keeper
        what: node
      annotations:
        prometheus.io/port: '7000'
        prometheus.io/scrape: 'true'
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                  - key: "app"
                    operator: In
                    values:
                      - clickhouse-keeper
              topologyKey: "kubernetes.io/hostname"
      volumes:
        - name: clickhouse-keeper-settings
          configMap:
            name: clickhouse-keeper-settings
            items:
              - key: keeper_config.xml
                path: keeper_config.xml
      containers:
        - name: clickhouse-keeper
          imagePullPolicy: IfNotPresent
          image: "hub.docker.io/dockerhub/clickhouse/clickhouse-keeper:head-alpine"
          resources:
            requests:
              memory: "256M"
              cpu: "1"
            limits:
              memory: "4Gi"
              cpu: "2"
          volumeMounts:
            - name: clickhouse-keeper-settings
              mountPath: /etc/clickhouse-keeper/
            - name: clickhouse-keeper-datadir-volume
              mountPath: /var/lib/clickhouse-keeper
          env:
            - name: SERVERS
              value: "3"
            - name: RAFT_PORT
              value: "9444"
          command:
            - bash
            - -x
            - -c
            - |
              HOST=`hostname -s` &&
              DOMAIN=`hostname -d` &&
              if [[ $HOST =~ (.*)-([0-9]+)$ ]]; then
                  NAME=${BASH_REMATCH[1]}
                  ORD=${BASH_REMATCH[2]}
              else
                  echo "Failed to parse name and ordinal of Pod"
                  exit 1
              fi &&
              export MY_ID=$((ORD+1)) &&
              mkdir -p /tmp/clickhouse-keeper/config.d/ &&
              {
                echo "<yandex><keeper_server>"
                echo "<server_id>${MY_ID}</server_id>"
                echo "<raft_configuration>"
                for (( i=1; i<=$SERVERS; i++ )); do
                    echo "<server><id>${i}</id><hostname>$NAME-$((i-1)).${DOMAIN}</hostname><port>${RAFT_PORT}</port></server>"
                done
                echo "</raft_configuration>"
                echo "</keeper_server></yandex>"
              } > /tmp/clickhouse-keeper/config.d/generated-keeper-settings.xml &&
              cat /tmp/clickhouse-keeper/config.d/generated-keeper-settings.xml &&
              if [[ "1" == "$MY_ID" ]]; then
                clickhouse-keeper --config-file=/etc/clickhouse-keeper/keeper_config.xml --force-recovery
              else
                clickhouse-keeper --config-file=/etc/clickhouse-keeper/keeper_config.xml
              fi
          livenessProbe:
            exec:
              command:
                - bash
                - -xc
                - 'date && OK=$(exec 3<>/dev/tcp/127.0.0.1/2181 ; printf "ruok" >&3 ; IFS=; tee <&3; exec 3<&- ;); if [[ "$OK" == "imok" ]]; then exit 0; else exit 1; fi'
            initialDelaySeconds: 20
            timeoutSeconds: 15
          ports:
            - containerPort: 7000
              name: prometheus
  volumeClaimTemplates:
    - metadata:
        name: clickhouse-keeper-datadir-volume
      spec:
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 25Gi

alexey-milovidov commented 1 year ago

This is the manifest we are using: https://github.com/Altinity...

I don't trust this manifest (it's from a third-party company). Does the issue reproduce if you run Keeper without Kubernetes?

alexey-milovidov commented 1 year ago

The manifest looks hairy; I advise throwing it away and writing your own from scratch.

jaigopinathmalempati commented 1 year ago

Do We have helm chart for clickhouse keeper?

tman5 commented 1 year ago

@alexey-milovidov we have not tried running keeper outside of K8s. We really weren't going to entertain that unless absolutely necessary since our install is a single application that will be using keeper at the moment

Is there a better example of running clickhouse-keeper in kubernetes? The only part of the config that appears to be very specific is the StateFul set config which includes a block that write clickhouse-keeper config. Is there any examples of that I could base it on?

alexey-milovidov commented 1 year ago

if [[ "1" == "$MY_ID" ]]; then
    clickhouse-keeper --config-file=/etc/clickhouse-keeper/keeper_config.xml --force-recovery

This looks suspicious, not sure if it is correct.

alexey-milovidov commented 1 year ago

Let's ask @antonio2368 for the details.

antonio2368 commented 12 months ago

--force-recovery is a command that should NOT be used in such a way, it's a last resort option when you lose enough nodes so quorum cannot be achieved anymore.

antonio2368 commented 12 months ago

@tman5 When logs for the Keeper are included, it would be helpful to set keeper_server.coordination_settings.raft_logs_level to trace in config. There will be much more information about the replication process itself.

tman5 commented 12 months ago

So why would that IF block be in there? If we remove --force-recovery than the IF statement wouldn't even be needed. It looks like it's looking for the 1st server in the cluster?

if [[ "1" == "$MY_ID" ]]; then
  clickhouse-keeper --config-file=/etc/clickhouse-keeper/keeper_config.xml --force-recovery
else
  clickhouse-keeper --config-file=/etc/clickhouse-keeper/keeper_config.xml
fi

Slach commented 12 months ago

@tman5 look to https://github.com/Altinity/clickhouse-operator/pull/1234 currently if you wan't to scale-up / scale-down for clickhouse-keeper in Kubenetes, you need wait when https://github.com/ClickHouse/ClickHouse/pull/53481 will merge

tman5 commented 12 months ago

So will your updated manifests work? Or do we also need to wait for that PR to merge?

alexey-milovidov commented 12 months ago

@tman5, these manifests are not part of the official ClickHouse product, and we don't support them. "Altinity/clickhouse-operator" is a third-party repository.

We have noticed at least one mistake in these manifests, so they cannot be used. You can carefully review every line of code of these manifests, remove every line that you don't understand, and then it might be ok.

tman5 commented 11 months ago

@alexey-milovidov is there any plans to release an official helm chart for clickhouse-keeper?

alexey-milovidov commented 11 months ago

Currently, there are no plans, but we are considering it for the future.

Note: it is hard to operate Keeper or ZooKeeper or any other distributed consensus system in Kubernetes. If you have frequent pod restarts and combine it either with a misconfiguration (in the example above) or with corrupted data on a single node, it can lead to a rollback of the Keeper's state, leading to "intersecting parts" errors and data loss.

ClickHouse / ClickHouse

Clickhouse-Keeper on Kubernetes - Node/Pod Restart Issues #55219