Closed tman5 closed 12 months ago
It's hard to say what happened without the logs before shutdown and during the first failed start to see where the error happened. It's possible something was incorrectly applied.
Logs attached keeper_logs.zip
Manifests here:
---
# Setup Service to provide access to ClickHouse keeper for clients
apiVersion: v1
kind: Service
metadata:
# DNS would be like clickhouse-keeper.namespace.svc
name: clickhouse-keeper
labels:
app: clickhouse-keeper
spec:
ports:
- port: 2181
name: client
- port: 7000
name: prometheus
selector:
app: clickhouse-keeper
what: node
---
# Setup Headless Service for StatefulSet
apiVersion: v1
kind: Service
metadata:
# DNS would be like clickhouse-keeper-0.clickhouse-keepers.namespace.svc
name: clickhouse-keepers
labels:
app: clickhouse-keeper
spec:
ports:
- port: 9444
name: raft
clusterIP: None
selector:
app: clickhouse-keeper
what: node
---
# Setup max number of unavailable pods in StatefulSet
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: clickhouse-keeper-pod-disruption-budget
spec:
selector:
matchLabels:
app: clickhouse-keeper
maxUnavailable: 1
---
# Setup ClickHouse Keeper settings
apiVersion: v1
kind: ConfigMap
metadata:
name: clickhouse-keeper-settings
data:
keeper_config.xml: |
<clickhouse>
<include_from>/tmp/clickhouse-keeper/config.d/generated-keeper-settings.xml</include_from>
<logger>
<level>trace</level>
<console>true</console>
</logger>
<listen_host>0.0.0.0</listen_host>
<keeper_server incl="keeper_server">
<path>/var/lib/clickhouse-keeper</path>
<tcp_port>2181</tcp_port>
<coordination_settings>
<!-- <raft_logs_level>trace</raft_logs_level> -->
<raft_logs_level>information</raft_logs_level>
</coordination_settings>
</keeper_server>
<prometheus>
<endpoint>/metrics</endpoint>
<port>7000</port>
<metrics>true</metrics>
<events>true</events>
<asynchronous_metrics>true</asynchronous_metrics>
<!-- https://github.com/ClickHouse/ClickHouse/issues/46136 -->
<status_info>false</status_info>
</prometheus>
</clickhouse>
---
# Setup ClickHouse Keeper StatefulSet
apiVersion: apps/v1
kind: StatefulSet
metadata:
# nodes would be named as clickhouse-keeper-0, clickhouse-keeper-1, clickhouse-keeper-2
name: clickhouse-keeper
labels:
app: clickhouse-keeper
spec:
selector:
matchLabels:
app: clickhouse-keeper
serviceName: clickhouse-keepers
replicas: 3
updateStrategy:
type: RollingUpdate
podManagementPolicy: Parallel
template:
metadata:
labels:
app: clickhouse-keeper
what: node
annotations:
prometheus.io/port: '7000'
prometheus.io/scrape: 'true'
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: "app"
operator: In
values:
- clickhouse-keeper
topologyKey: "kubernetes.io/hostname"
volumes:
- name: clickhouse-keeper-settings
configMap:
name: clickhouse-keeper-settings
items:
- key: keeper_config.xml
path: keeper_config.xml
containers:
- name: clickhouse-keeper
imagePullPolicy: IfNotPresent
image: "hub.docker.io/dockerhub/clickhouse/clickhouse-keeper:head-alpine"
resources:
requests:
memory: "256M"
cpu: "1"
limits:
memory: "4Gi"
cpu: "2"
volumeMounts:
- name: clickhouse-keeper-settings
mountPath: /etc/clickhouse-keeper/
- name: clickhouse-keeper-datadir-volume
mountPath: /var/lib/clickhouse-keeper
env:
- name: SERVERS
value: "3"
- name: RAFT_PORT
value: "9444"
command:
- bash
- -x
- -c
- |
HOST=`hostname -s` &&
DOMAIN=`hostname -d` &&
if [[ $HOST =~ (.*)-([0-9]+)$ ]]; then
NAME=${BASH_REMATCH[1]}
ORD=${BASH_REMATCH[2]}
else
echo "Failed to parse name and ordinal of Pod"
exit 1
fi &&
export MY_ID=$((ORD+1)) &&
mkdir -p /tmp/clickhouse-keeper/config.d/ &&
{
echo "<yandex><keeper_server>"
echo "<server_id>${MY_ID}</server_id>"
echo "<raft_configuration>"
for (( i=1; i<=$SERVERS; i++ )); do
echo "<server><id>${i}</id><hostname>$NAME-$((i-1)).${DOMAIN}</hostname><port>${RAFT_PORT}</port></server>"
done
echo "</raft_configuration>"
echo "</keeper_server></yandex>"
} > /tmp/clickhouse-keeper/config.d/generated-keeper-settings.xml &&
cat /tmp/clickhouse-keeper/config.d/generated-keeper-settings.xml &&
if [[ "1" == "$MY_ID" ]]; then
clickhouse-keeper --config-file=/etc/clickhouse-keeper/keeper_config.xml --force-recovery
else
clickhouse-keeper --config-file=/etc/clickhouse-keeper/keeper_config.xml
fi
livenessProbe:
exec:
command:
- bash
- -xc
- 'date && OK=$(exec 3<>/dev/tcp/127.0.0.1/2181 ; printf "ruok" >&3 ; IFS=; tee <&3; exec 3<&- ;); if [[ "$OK" == "imok" ]]; then exit 0; else exit 1; fi'
initialDelaySeconds: 20
timeoutSeconds: 15
ports:
- containerPort: 7000
name: prometheus
volumeClaimTemplates:
- metadata:
name: clickhouse-keeper-datadir-volume
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 25Gi
This is the manifest we are using: https://github.com/Altinity...
I don't trust this manifest (it's from a third-party company). Does the issue reproduce if you run Keeper without Kubernetes?
The manifest looks hairy; I advise throwing it away and writing your own from scratch.
Do We have helm chart for clickhouse keeper?
@alexey-milovidov we have not tried running keeper outside of K8s. We really weren't going to entertain that unless absolutely necessary since our install is a single application that will be using keeper at the moment
Is there a better example of running clickhouse-keeper in kubernetes? The only part of the config that appears to be very specific is the StateFul set config which includes a block that write clickhouse-keeper config. Is there any examples of that I could base it on?
if [[ "1" == "$MY_ID" ]]; then
clickhouse-keeper --config-file=/etc/clickhouse-keeper/keeper_config.xml --force-recovery
This looks suspicious, not sure if it is correct.
Let's ask @antonio2368 for the details.
--force-recovery
is a command that should NOT be used in such a way, it's a last resort option when you lose enough nodes so quorum cannot be achieved anymore.
@tman5 When logs for the Keeper are included, it would be helpful to set keeper_server.coordination_settings.raft_logs_level
to trace
in config. There will be much more information about the replication process itself.
So why would that IF block be in there? If we remove --force-recovery
than the IF statement wouldn't even be needed. It looks like it's looking for the 1st server in the cluster?
if [[ "1" == "$MY_ID" ]]; then
clickhouse-keeper --config-file=/etc/clickhouse-keeper/keeper_config.xml --force-recovery
else
clickhouse-keeper --config-file=/etc/clickhouse-keeper/keeper_config.xml
fi
@tman5 look to https://github.com/Altinity/clickhouse-operator/pull/1234 currently if you wan't to scale-up / scale-down for clickhouse-keeper in Kubenetes, you need wait when https://github.com/ClickHouse/ClickHouse/pull/53481 will merge
So will your updated manifests work? Or do we also need to wait for that PR to merge?
@tman5, these manifests are not part of the official ClickHouse product, and we don't support them. "Altinity/clickhouse-operator" is a third-party repository.
We have noticed at least one mistake in these manifests, so they cannot be used. You can carefully review every line of code of these manifests, remove every line that you don't understand, and then it might be ok.
@alexey-milovidov is there any plans to release an official helm chart for clickhouse-keeper?
Currently, there are no plans, but we are considering it for the future.
Note: it is hard to operate Keeper or ZooKeeper or any other distributed consensus system in Kubernetes. If you have frequent pod restarts and combine it either with a misconfiguration (in the example above) or with corrupted data on a single node, it can lead to a rollback of the Keeper's state, leading to "intersecting parts" errors and data loss.
Upon rebooting an underlying Kubernetes node or re-creating a StateFul set for clickhouse-keeper in k8s, sometimes the pods will come back and be in a CrashLoop state with errors such as:
clickhouse-keeper version 23.9.1
This issue appears to be similar to this one https://github.com/ClickHouse/ClickHouse/issues/42668 however this one is on K8s using stateful set. This is the manifest we are using: https://github.com/Altinity/clickhouse-operator/blob/master/deploy/clickhouse-keeper/clickhouse-keeper-3-nodes.yaml
It seems like it's an order of operations/race condition issue. I can't reproduce it reliably. Sometimes node reboots work fine. Other times the clickhouse-keeper pods will come up in this crashloop state.
A "fix" is to delete the pod and PVC and let it re-create. That will bring it back but it's not a long term solution.