Altinity / clickhouse-operator

Altinity Kubernetes Operator for ClickHouse creates, configures and manages ClickHouse clusters running on Kubernetes
https://altinity.com
Apache License 2.0
1.87k stars 457 forks source link

How to set flags? #972

Closed rsaphala closed 2 years ago

rsaphala commented 2 years ago

I need to run this command:

sudo -u clickhouse touch /var/lib/clickhouse/flags/force_restore_data

However my cluster wouldn't start and keeps crashing due to corrupted parts

022.07.05 15:17:03.037900 [ 246 ] {} <Error> prod_src_events.diaenne_convlogs (66c00f33-6846-4718-85e7-37666ed867ee): Detaching broken part /var/lib/clickhouse/store/66c/66c00f33-6846-4718-85e7-37666ed867ee/202206_228681_228685_1 (size: 0.00 B). 
If it happened after update, it is likely because of backward incompability. You need to resolve this manually
2022.07.05 15:17:03.039666 [ 246 ] {} <Error> auto DB::MergeTreeData::loadDataPartsFromDisk(DB::MergeTreeData::DataPartsVector &, DB::MergeTreeData::DataPartsVector &, ThreadPool &, size_t, std::queue<std::vector<std::pair<String, DiskPtr>>> &, bool, const DB::MergeTreeSettingsPtr &)::(anonymous class)::operator()(const DB::String &, const DB::DiskPtr &) const: Code: 27. DB::ParsingException: Cannot parse input: expected 'columns format version: 1\n' at end of stream. (CANNOT_PARSE_INPUT_ASSERTION_FAILED), Stack trace (when copying this message, always include the lines below):

0. DB::Exception::Exception(std::_1::basic_string<char, std::1::char_traits<char>, std::_1::allocator<char> > const&, int, bool) @ 0xa82d07a in /usr/bin/clickhouse

Any ideas on how to apply the flag from the cluster configuration?

Slach commented 2 years ago

Are all your replicas in shard stay in state CrashLoopBack or only one?

rsaphala commented 2 years ago

I have 3 replicas, only one is running

Slach commented 2 years ago

Is it production cluster or just test\staging?

Do you use volumeClaimTemplates: in your chi? I try to figure out with root reason why your data parts broke.

rsaphala commented 2 years ago

this is my yaml setting. this is production cluster


Name:         clickhouse
Namespace:    clickhouse
Labels:       app.kubernetes.io/managed-by=pulumi
Annotations:  <none>
API Version:  clickhouse.altinity.com/v1
Kind:         ClickHouseInstallation
Metadata:
  Creation Timestamp:  2022-02-04T06:28:04Z
  Finalizers:
    finalizer.clickhouseinstallation.altinity.com
  Generation:  1
  Managed Fields:
    API Version:  clickhouse.altinity.com/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:finalizers:
          .:
          v:"finalizer.clickhouseinstallation.altinity.com":
    Manager:      clickhouse-operator
    Operation:    Update
    Time:         2022-02-04T06:28:04Z
    API Version:  clickhouse.altinity.com/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:kubectl.kubernetes.io/last-applied-configuration:
        f:labels:
          .:
          f:app.kubernetes.io/managed-by:
      f:spec:
        .:
        f:configuration:
          .:
          f:clusters:
          f:files:
          f:profiles:
            .:
            f:default/enable_positional_arguments:
            f:default/log_formatted_queries:
            f:default/log_queries:
            f:default/log_query_views:
          f:settings:
            .:
            f:logger/count:
            f:logger/size:
            f:mysql_port:
            f:storage_configuration/disks/archive/path:
            f:storage_configuration/disks/cold/path:
            f:storage_configuration/policies/default/volumes/default/archive/disk:
            f:storage_configuration/policies/default/volumes/default/cold/disk:
            f:storage_configuration/policies/default/volumes/default/disk:
            f:storage_configuration/policies/multi/volumes/disk1/disk:
            f:storage_configuration/policies/multi/volumes/disk2/disk:
            f:storage_configuration/policies/multi/volumes/disk3/disk:
          f:users:
            .:
            f:clickhouse_admin/access_management:
            f:clickhouse_admin/http_connection_timeout:
            f:clickhouse_admin/log_queries:
            f:clickhouse_admin/networks/ip:
            f:clickhouse_admin/password_double_sha1_hex:
            f:clickhouse_admin/profile:
            f:clickhouse_admin/quota:
            f:clickhouse_admin/skip_unavailable_shards:
          f:zookeeper:
            .:
            f:nodes:
        f:defaults:
          .:
          f:templates:
            .:
            f:logVolumeClaimTemplate:
            f:podTemplate:
        f:templates:
          .:
          f:podTemplates:
          f:volumeClaimTemplates:
    Manager:         pulumi-resource-kubernetes
    Operation:       Update
    Time:            2022-02-04T06:28:04Z
  Resource Version:  326494835
  UID:               23ba9b2b-a8cb-4a9d-afb1-9c0639b31556
Spec:
  Configuration:
    Clusters:
      Layout:
        Replicas Count:  3
        Shards Count:    1
      Name:              cluster
    Files:
    Profiles:
      default/enable_positional_arguments:  1
      default/log_formatted_queries:        1
      default/log_queries:                  1
      default/log_query_views:              1
    Settings:
      logger/count:                                                         10
      logger/size:                                                          1000M
      mysql_port:                                                           9004
      storage_configuration/disks/archive/path:                             /var/lib/clickhouse-archive/
      storage_configuration/disks/cold/path:                                /var/lib/clickhouse-cold/
      storage_configuration/policies/default/volumes/default/archive/disk:  clickhouse-archive-volume
      storage_configuration/policies/default/volumes/default/cold/disk:     clickhouse-cold-volume
      storage_configuration/policies/default/volumes/default/disk:          default
      storage_configuration/policies/multi/volumes/disk1/disk:              default
      storage_configuration/policies/multi/volumes/disk2/disk:              cold
      storage_configuration/policies/multi/volumes/disk3/disk:              archive
    Users:
      clickhouse_admin/access_management:        1
      clickhouse_admin/http_connection_timeout:  10
      clickhouse_admin/log_queries:              1
      clickhouse_admin/networks/ip:
        127.0.0.1
        0.0.0.0/0
        ::/0
      clickhouse_admin/password_double_sha1_hex:  1e52ed3390576ce6f28e944fbfdc6f4003510e5e
      clickhouse_admin/profile:                   default
      clickhouse_admin/quota:                     default
      clickhouse_admin/skip_unavailable_shards:   1
    Zookeeper:
      Nodes:
        Host:  zookeeper.clickhouse-zookeeper
  Defaults:
    Templates:
      Log Volume Claim Template:  clickhouse-log-volume
      Pod Template:               clickhouse-pod
  Templates:
    Pod Templates:
      Name:  clickhouse-pod
      Pod Distribution:
        Type:  ReplicaAntiAffinity
        Type:  ShardAntiAffinity
      Spec:
        Affinity:
          Node Affinity:
            Required During Scheduling Ignored During Execution:
              Node Selector Terms:
                Match Expressions:
                  Key:       name
                  Operator:  In
                  Values:
                    clickhouse-pool
        Containers:
          Image:  docker.io/yandex/clickhouse-server:22.1
          Liveness Probe:
            Failure Threshold:  10
            Http Get:
              Path:                 /ping
              Port:                 http
              Scheme:               HTTP
            Initial Delay Seconds:  60
            Period Seconds:         3
            Success Threshold:      1
            Timeout Seconds:        1
          Name:                     clickhouse
          Ports:
            Container Port:  9000
            Name:            tcp
            Protocol:        TCP
            Container Port:  8123
            Name:            http
            Protocol:        TCP
            Container Port:  9009
            Name:            interserver
            Protocol:        TCP
            Container Port:  9004
            Name:            mysql
            Protocol:        TCP
          Readiness Probe:
            Failure Threshold:  3
            Http Get:
              Path:                 /ping
              Port:                 http
              Scheme:               HTTP
            Initial Delay Seconds:  60
            Period Seconds:         3
            Success Threshold:      1
            Timeout Seconds:        1
          Resources:
            Limits:
              Cpu:     1.5
              Memory:  10Gi
            Requests:
              Cpu:     1.5
              Memory:  10Gi
          Volume Mounts:
            Mount Path:         /var/lib/clickhouse-archive
            Mount Propagation:  HostToContainer
            Name:               clickhouse-archive-volume
            Mount Path:         /var/lib/clickhouse-cold
            Name:               clickhouse-cold-volume
            Mount Path:         /var/lib/clickhouse
            Name:               clickhouse-hot-volume
        Init Containers:
          Command:
            sh
            -c
            chown 101 /var/lib/clickhouse-cold && chown 101 /var/lib/clickhouse-archive && chown 101 /var/lib/clickhouse && echo "prepared"
          Image:  busybox:1.28
          Name:   init-clickhouse
          Volume Mounts:
            Mount Path:  /var/lib/clickhouse-archive
            Name:        clickhouse-archive-volume
            Mount Path:  /var/lib/clickhouse-cold
            Name:        clickhouse-cold-volume
            Mount Path:  /var/lib/clickhouse
            Name:        clickhouse-hot-volume
        Tolerations:
          Effect:    NoSchedule
          Key:       clickhouse-pool
          Operator:  Equal
          Value:     true
    Volume Claim Templates:
      Name:            clickhouse-hot-volume
      Reclaim Policy:  Retain
      Spec:
        Access Modes:
          ReadWriteOnce
        Resources:
          Requests:
            Storage:         50Gi
        Storage Class Name:  standard-rwo
      Name:                  clickhouse-cold-volume
      Reclaim Policy:        Retain
      Spec:
        Access Modes:
          ReadWriteOnce
        Resources:
          Requests:
            Storage:         200Gi
        Storage Class Name:  standard-hdd
      Name:                  clickhouse-archive-volume
      Reclaim Policy:        Retain
      Spec:
        Access Modes:
          ReadWriteOnce
        Resources:
          Requests:
            Storage:         11Ti
        Storage Class Name:  juicefs-sc
      Name:                  clickhouse-log-volume
      Reclaim Policy:        Retain
      Spec:
        Access Modes:
          ReadWriteOnce
        Resources:
          Requests:
            Storage:         5Gi
        Storage Class Name:  standard-rwo
Events:                      <none>
Slach commented 2 years ago

Add touch /var/lib/clickhouse/flags/force_restore_data manually to spec.template.initContainers[0].command in broken statefulset like clickhouse-cluster-0-0-0 after statefulset will re-create pod and pod will start. change chi (kind:ClickHouseInstallation) with name clickhouse spec.taskID field to trigger operator reconcile cycle and return command value to initContainers

unfortunately, you can't change inside chi

spec:
  troubleshoot: yes

cause only 2 from 3 replicas is broken and it is production cluster

Moreover, we don't have experience with juicefs, it stores their metadata in redis. Are you sure your juicefs is OK?