Closed rsaphala closed 2 years ago
Are all your replicas in shard stay in state CrashLoopBack
or only one?
I have 3 replicas, only one is running
Is it production cluster or just test\staging?
Do you use volumeClaimTemplates:
in your chi
?
I try to figure out with root reason why your data parts broke.
this is my yaml setting. this is production cluster
Name: clickhouse
Namespace: clickhouse
Labels: app.kubernetes.io/managed-by=pulumi
Annotations: <none>
API Version: clickhouse.altinity.com/v1
Kind: ClickHouseInstallation
Metadata:
Creation Timestamp: 2022-02-04T06:28:04Z
Finalizers:
finalizer.clickhouseinstallation.altinity.com
Generation: 1
Managed Fields:
API Version: clickhouse.altinity.com/v1
Fields Type: FieldsV1
fieldsV1:
f:metadata:
f:finalizers:
.:
v:"finalizer.clickhouseinstallation.altinity.com":
Manager: clickhouse-operator
Operation: Update
Time: 2022-02-04T06:28:04Z
API Version: clickhouse.altinity.com/v1
Fields Type: FieldsV1
fieldsV1:
f:metadata:
f:annotations:
.:
f:kubectl.kubernetes.io/last-applied-configuration:
f:labels:
.:
f:app.kubernetes.io/managed-by:
f:spec:
.:
f:configuration:
.:
f:clusters:
f:files:
f:profiles:
.:
f:default/enable_positional_arguments:
f:default/log_formatted_queries:
f:default/log_queries:
f:default/log_query_views:
f:settings:
.:
f:logger/count:
f:logger/size:
f:mysql_port:
f:storage_configuration/disks/archive/path:
f:storage_configuration/disks/cold/path:
f:storage_configuration/policies/default/volumes/default/archive/disk:
f:storage_configuration/policies/default/volumes/default/cold/disk:
f:storage_configuration/policies/default/volumes/default/disk:
f:storage_configuration/policies/multi/volumes/disk1/disk:
f:storage_configuration/policies/multi/volumes/disk2/disk:
f:storage_configuration/policies/multi/volumes/disk3/disk:
f:users:
.:
f:clickhouse_admin/access_management:
f:clickhouse_admin/http_connection_timeout:
f:clickhouse_admin/log_queries:
f:clickhouse_admin/networks/ip:
f:clickhouse_admin/password_double_sha1_hex:
f:clickhouse_admin/profile:
f:clickhouse_admin/quota:
f:clickhouse_admin/skip_unavailable_shards:
f:zookeeper:
.:
f:nodes:
f:defaults:
.:
f:templates:
.:
f:logVolumeClaimTemplate:
f:podTemplate:
f:templates:
.:
f:podTemplates:
f:volumeClaimTemplates:
Manager: pulumi-resource-kubernetes
Operation: Update
Time: 2022-02-04T06:28:04Z
Resource Version: 326494835
UID: 23ba9b2b-a8cb-4a9d-afb1-9c0639b31556
Spec:
Configuration:
Clusters:
Layout:
Replicas Count: 3
Shards Count: 1
Name: cluster
Files:
Profiles:
default/enable_positional_arguments: 1
default/log_formatted_queries: 1
default/log_queries: 1
default/log_query_views: 1
Settings:
logger/count: 10
logger/size: 1000M
mysql_port: 9004
storage_configuration/disks/archive/path: /var/lib/clickhouse-archive/
storage_configuration/disks/cold/path: /var/lib/clickhouse-cold/
storage_configuration/policies/default/volumes/default/archive/disk: clickhouse-archive-volume
storage_configuration/policies/default/volumes/default/cold/disk: clickhouse-cold-volume
storage_configuration/policies/default/volumes/default/disk: default
storage_configuration/policies/multi/volumes/disk1/disk: default
storage_configuration/policies/multi/volumes/disk2/disk: cold
storage_configuration/policies/multi/volumes/disk3/disk: archive
Users:
clickhouse_admin/access_management: 1
clickhouse_admin/http_connection_timeout: 10
clickhouse_admin/log_queries: 1
clickhouse_admin/networks/ip:
127.0.0.1
0.0.0.0/0
::/0
clickhouse_admin/password_double_sha1_hex: 1e52ed3390576ce6f28e944fbfdc6f4003510e5e
clickhouse_admin/profile: default
clickhouse_admin/quota: default
clickhouse_admin/skip_unavailable_shards: 1
Zookeeper:
Nodes:
Host: zookeeper.clickhouse-zookeeper
Defaults:
Templates:
Log Volume Claim Template: clickhouse-log-volume
Pod Template: clickhouse-pod
Templates:
Pod Templates:
Name: clickhouse-pod
Pod Distribution:
Type: ReplicaAntiAffinity
Type: ShardAntiAffinity
Spec:
Affinity:
Node Affinity:
Required During Scheduling Ignored During Execution:
Node Selector Terms:
Match Expressions:
Key: name
Operator: In
Values:
clickhouse-pool
Containers:
Image: docker.io/yandex/clickhouse-server:22.1
Liveness Probe:
Failure Threshold: 10
Http Get:
Path: /ping
Port: http
Scheme: HTTP
Initial Delay Seconds: 60
Period Seconds: 3
Success Threshold: 1
Timeout Seconds: 1
Name: clickhouse
Ports:
Container Port: 9000
Name: tcp
Protocol: TCP
Container Port: 8123
Name: http
Protocol: TCP
Container Port: 9009
Name: interserver
Protocol: TCP
Container Port: 9004
Name: mysql
Protocol: TCP
Readiness Probe:
Failure Threshold: 3
Http Get:
Path: /ping
Port: http
Scheme: HTTP
Initial Delay Seconds: 60
Period Seconds: 3
Success Threshold: 1
Timeout Seconds: 1
Resources:
Limits:
Cpu: 1.5
Memory: 10Gi
Requests:
Cpu: 1.5
Memory: 10Gi
Volume Mounts:
Mount Path: /var/lib/clickhouse-archive
Mount Propagation: HostToContainer
Name: clickhouse-archive-volume
Mount Path: /var/lib/clickhouse-cold
Name: clickhouse-cold-volume
Mount Path: /var/lib/clickhouse
Name: clickhouse-hot-volume
Init Containers:
Command:
sh
-c
chown 101 /var/lib/clickhouse-cold && chown 101 /var/lib/clickhouse-archive && chown 101 /var/lib/clickhouse && echo "prepared"
Image: busybox:1.28
Name: init-clickhouse
Volume Mounts:
Mount Path: /var/lib/clickhouse-archive
Name: clickhouse-archive-volume
Mount Path: /var/lib/clickhouse-cold
Name: clickhouse-cold-volume
Mount Path: /var/lib/clickhouse
Name: clickhouse-hot-volume
Tolerations:
Effect: NoSchedule
Key: clickhouse-pool
Operator: Equal
Value: true
Volume Claim Templates:
Name: clickhouse-hot-volume
Reclaim Policy: Retain
Spec:
Access Modes:
ReadWriteOnce
Resources:
Requests:
Storage: 50Gi
Storage Class Name: standard-rwo
Name: clickhouse-cold-volume
Reclaim Policy: Retain
Spec:
Access Modes:
ReadWriteOnce
Resources:
Requests:
Storage: 200Gi
Storage Class Name: standard-hdd
Name: clickhouse-archive-volume
Reclaim Policy: Retain
Spec:
Access Modes:
ReadWriteOnce
Resources:
Requests:
Storage: 11Ti
Storage Class Name: juicefs-sc
Name: clickhouse-log-volume
Reclaim Policy: Retain
Spec:
Access Modes:
ReadWriteOnce
Resources:
Requests:
Storage: 5Gi
Storage Class Name: standard-rwo
Events: <none>
Add touch /var/lib/clickhouse/flags/force_restore_data
manually to spec.template.initContainers[0].command
in broken statefulset
like clickhouse-cluster-0-0-0
after statefulset
will re-create pod and pod will start. change chi
(kind:ClickHouseInstallation) with name clickhouse
spec.taskID
field to trigger operator reconcile cycle and return command value to initContainers
unfortunately, you can't change inside chi
spec:
troubleshoot: yes
cause only 2 from 3 replicas is broken and it is production cluster
Moreover, we don't have experience with juicefs
, it stores their metadata in redis
. Are you sure your juicefs
is OK?
I need to run this command:
However my cluster wouldn't start and keeps crashing due to corrupted parts
Any ideas on how to apply the flag from the cluster configuration?