Altinity / clickhouse-operator

Altinity Kubernetes Operator for ClickHouse creates, configures and manages ClickHouse® clusters running on Kubernetes
https://altinity.com
Apache License 2.0
1.93k stars 463 forks source link

Unable to run distributed DDL #856

Open ragsarang opened 2 years ago

ragsarang commented 2 years ago

Describe the unexpected behaviour When I execute a statement with ON CLUSTER, It should be able to run it on all the shards and replicas. However it is timing out.

How to reproduce

Expected behavior The statement should be executed and database/ table should be created successfully

Error message and/or stacktrace

Query id: 98c72107-9eab-40be-b56e-11dcefbf4e59

0 rows in set. Elapsed: 180.672 sec.

Received exception from server (version 21.12.3):
Code: 159. DB::Exception: Received from localhost:9000. DB::Exception: Watching task /clickhouse/repl-1s1r/task_queue/ddl/query-0000000003 is executing longer than distributed_ddl_task_timeout (=180) seconds. There are 1 unfinished hosts (0 of them are currently active), they are going to execute the query in background. (TIMEOUT_EXCEEDED)

Additional context https://github.com/ClickHouse/ClickHouse/issues/33019

ragsarang commented 2 years ago

Describe the unexpected behaviour When I execute a statement with ON CLUSTER, It should be able to run it on all the shards and replicas. However it is timing out.

How to reproduce

  • Which ClickHouse server version to use: 21.12.3.32
  • Which interface to use, if matters: Clickhouse-operator
  • CREATE TABLE statements for all tables involved:
CREATE TABLE events_local on cluster '{cluster}' ( 
    event_date  Date, 
    event_type  Int32, 
    article_id  Int32, 
    title       String 
) engine=ReplicatedMergeTree('/clickhouse/{installation}/{cluster}/tables/{shard}/{database}/{table}', '{replica}') 
PARTITION BY toYYYYMM(event_date) 
ORDER BY (event_type, article_id);
  • Queries to run that lead to unexpected result Above one and also create database statement: CREATE DATABASE test ON CLUSTER '{cluster}' ;

Expected behavior The statement should be executed and database/ table should be created successfully

Error message and/or stacktrace

Query id: 98c72107-9eab-40be-b56e-11dcefbf4e59

0 rows in set. Elapsed: 180.672 sec.

Received exception from server (version 21.12.3):
Code: 159. DB::Exception: Received from localhost:9000. DB::Exception: Watching task /clickhouse/repl-1s1r/task_queue/ddl/query-0000000003 is executing longer than distributed_ddl_task_timeout (=180) seconds. There are 1 unfinished hosts (0 of them are currently active), they are going to execute the query in background. (TIMEOUT_EXCEEDED)

Additional context ClickHouse/ClickHouse#33019

I updated clickhouse installation spec with replicasUseFQDN: "yes". Now the hostname is rendered like below. But the address is not rendered properly and unable to ping the rendered address in hostname section. How can we modify the hostname parameter to render the correct address as per the /etc/hosts ?

SELECT *
FROM system.clusters

Query id: 28ce0733-8f8c-4782-bb66-02495c32d732

┌─cluster──────────────────────────────────────┬─shard_num─┬─shard_weight─┬─replica_num─┬─host_name───────────────────────────────────────────┬─host_address─┬─port─┬─is_local─┬─user────┬─default_database─┬─errors_count─┬─slowdowns_count─┬─estimated_recovery_time─┐
│ all-replicated                               │         1 │            1 │           1 │ chi-repl-1s1r-replcluster-0-0.ch1.svc.cluster.local │              │ 9000 │        0 │ default │                  │            0 │               0 │                       0 │
│ all-sharded                                  │         1 │            1 │           1 │ chi-repl-1s1r-replcluster-0-0.ch1.svc.cluster.local │              │ 9000 │        0 │ default │                  │            0 │               0 │                       0 │
│ replcluster                                  │         1 │            1 │           1 │ chi-repl-1s1r-replcluster-0-0.ch1.svc.cluster.local │              │ 9000 │        0 │ default │                  │            0 │               0 │                       0 │
│ test_cluster_two_shards                      │         1 │            1 │           1 │ 127.0.0.1                                           │ 127.0.0.1    │ 9000 │        1 │ default │                  │            0 │               0 │                       0 │
│ test_cluster_two_shards                      │         2 │            1 │           1 │ 127.0.0.2                                           │ 127.0.0.2    │ 9000 │        0 │ default │                  │            0 │               0 │                       0 │
│ test_cluster_two_shards_internal_replication │         1 │            1 │           1 │ 127.0.0.1                                           │ 127.0.0.1    │ 9000 │        1 │ default │                  │            0 │               0 │                       0 │
│ test_cluster_two_shards_internal_replication │         2 │            1 │           1 │ 127.0.0.2                                           │ 127.0.0.2    │ 9000 │        0 │ default │                  │            0 │               0 │                       0 │
│ test_cluster_two_shards_localhost            │         1 │            1 │           1 │ localhost                                           │ ::1          │ 9000 │        1 │ default │                  │            0 │               0 │                       0 │
│ test_cluster_two_shards_localhost            │         2 │            1 │           1 │ localhost                                           │ ::1          │ 9000 │        1 │ default │                  │            0 │               0 │                       0 │
│ test_shard_localhost                         │         1 │            1 │           1 │ localhost                                           │ ::1          │ 9000 │        1 │ default │                  │            0 │               0 │                       0 │
│ test_shard_localhost_secure                  │         1 │            1 │           1 │ localhost                                           │ ::1          │ 9440 │        0 │ default │                  │            0 │               0 │                       0 │
│ test_unavailable_shard                       │         1 │            1 │           1 │ localhost                                           │ ::1          │ 9000 │        1 │ default │                  │            0 │               0 │                       0 │
│ test_unavailable_shard                       │         2 │            1 │           1 │ localhost                                           │ ::1          │    1 │        0 │ default │                  │            0 │               0 │                       0 │
└──────────────────────────────────────────────┴───────────┴──────────────┴─────────────┴─────────────────────────────────────────────────────┴──────────────┴──────┴──────────┴─────────┴──────────────────┴──────────────┴─────────────────┴─────────────────────────┘

13 rows in set. Elapsed: 0.021 sec.
Slach commented 2 years ago

could you share

kubectl get svc -n ch1 -o wide

and

kubectl get chi -n ch1 repl-1s1r -o yaml
ragsarang commented 2 years ago

Please find the details below:

kubectl get svc -n ch1 -o wide

NAME                            TYPE           CLUSTER-IP                         EXTERNAL-IP   PORT(S)                         AGE   SELECTOR
chi-repl-1s1r-replcluster-0-0   ClusterIP      None                               <none>        8123/TCP,9000/TCP,9009/TCP      25h   clickhouse.altinity.com/app=chop,clickhouse.altinity.com/chi=repl-1s1r,clickhouse.altinity.com/cluster=replcluster,clickhouse.altinity.com/namespace=ch1,clickhouse.altinity.com/replica=0,clickhouse.altinity.com/shard=0
clickhouse-operator-metrics     ClusterIP      xxxx:xxxx:xxx:xxxx:xxxx:xx:0:a8c0   <none>        8888/TCP                        15d   app=clickhouse-operator
clickhouse-repl-1s1r            LoadBalancer   xxxx:xxxx:xxx:xxxx:xxxx:xx:0:e2ed   <pending>     8123:32187/TCP,9000:30449/TCP   28h   clickhouse.altinity.com/app=chop,clickhouse.altinity.com/chi=repl-1s1r,clickhouse.altinity.com/namespace=ch1,clickhouse.altinity.com/ready=yes

kubectl get chi -n ch1 repl-1s1r -o yaml

apiVersion: clickhouse.altinity.com/v1
kind: ClickHouseInstallation
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"clickhouse.altinity.com/v1","kind":"ClickHouseInstallation","metadata":{"annotations":{},"name":"repl-1s1r","namespace":"ch1"},"spec":{"configuration":{"clusters":[{"layout":{"replicasCount":1,"shardsCount":1},"name":"replcluster","templates":{"podTemplate":"clickhouse-with-volume-template"}}],"zookeeper":{"nodes":[{"host":"zookeeper-0.zookeepers.ch2.svc.uhxxxxxx7.local","port":2181},{"host":"zookeeper-1.zookeepers.ch2.svc.uhxxxxxx7.local","port":2181},{"host":"zookeeper-2.zookeepers.ch2.svc.uhxxxxxx7.local","port":2181}]}},"defaults":{"distributedDDL":{"profile":"default"},"replicasUseFQDN":"yes"},"templates":{"podTemplates":[{"name":"clickhouse-with-volume-template","spec":{"containers":[{"image":"private-docker-registry/clickhouse-server:21.12.3.32.ipv6","name":"clickhouse-pod","resources":{"requests":{"cpu":4,"memory":"32G"}},"volumeMounts":[{"mountPath":"/var/lib/clickhouse","name":"clickhouse-storage-template"}]}],"nodeSelector":{"robin.io/rnodetype":"robin-worker-node"},"tolerations":[{"effect":"NoSchedule","key":"k8s.sssssss.com/worker","operator":"Exists"}]}}],"volumeClaimTemplates":[{"name":"clickhouse-storage-template","spec":{"accessModes":["ReadWriteOnce"],"resources":{"requests":{"storage":"50Gi"}}}}]}}}
  creationTimestamp: "2022-01-04T04:38:10Z"
  finalizers:
  - finalizer.clickhouseinstallation.altinity.com
  generation: 11
  managedFields:
  - apiVersion: clickhouse.altinity.com/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          f:kubectl.kubernetes.io/last-applied-configuration: {}
      f:spec:
        .: {}
        f:configuration:
          .: {}
          f:clusters: {}
          f:zookeeper:
            .: {}
            f:nodes: {}
        f:defaults:
          .: {}
          f:distributedDDL:
            .: {}
            f:profile: {}
          f:replicasUseFQDN: {}
        f:templates: {}
    manager: kubectl
    operation: Update
    time: "2022-01-04T18:48:24Z"
  - apiVersion: clickhouse.altinity.com/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:finalizers:
          .: {}
          v:"finalizer.clickhouseinstallation.altinity.com": {}
      f:spec:
        f:templates:
          f:podTemplates: {}
          f:volumeClaimTemplates: {}
      f:status:
        .: {}
        f:actions: {}
        f:added: {}
        f:clusters: {}
        f:endpoint: {}
        f:error: {}
        f:errors: {}
        f:fqdns: {}
        f:generation: {}
        f:hosts: {}
        f:normalized:
          .: {}
          f:apiVersion: {}
          f:kind: {}
          f:metadata: {}
          f:spec: {}
          f:status: {}
        f:pods: {}
        f:replicas: {}
        f:shards: {}
        f:status: {}
        f:taskID: {}
        f:taskIDsCompleted: {}
        f:taskIDsStarted: {}
        f:version: {}
    manager: clickhouse-operator
    operation: Update
    time: "2022-01-04T18:50:24Z"
  name: repl-1s1r
  namespace: ch1
  resourceVersion: "334779133"
  uid: 5432de60-d78b-4bc3-ac71-909ef8ae899b
spec:
  configuration:
    clusters:
    - layout:
        replicasCount: 1
        shardsCount: 1
      name: replcluster
      templates:
        podTemplate: clickhouse-with-volume-template
    zookeeper:
      nodes:
      - host: zookeeper-0.zookeepers.ch2.svc.uhxxxxxx7.local
        port: 2181
      - host: zookeeper-1.zookeepers.ch2.svc.uhxxxxxx7.local
        port: 2181
      - host: zookeeper-2.zookeepers.ch2.svc.uhxxxxxx7.local
        port: 2181
  defaults:
    distributedDDL:
      profile: default
    replicasUseFQDN: "yes"
  templates:
    podTemplates:
    - name: clickhouse-with-volume-template
      spec:
        containers:
        - image: private-docker-registry/clickhouse-server:21.12.3.32.ipv6
          name: clickhouse-pod
          resources:
            requests:
              cpu: 4
              memory: 32G
          volumeMounts:
          - mountPath: /var/lib/clickhouse
            name: clickhouse-storage-template
        nodeSelector:
          robin.io/rnodetype: robin-worker-node
        tolerations:
        - effect: NoSchedule
          key: k8s.sssssss.com/worker
          operator: Exists
    volumeClaimTemplates:
    - name: clickhouse-storage-template
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 50Gi
status:
  actions:
  - reconcile completed
  - add CHI to monitoring
  - remove items scheduled for deletion
  - remove items scheduled for deletion
  - Update ConfigMap ch1/chi-repl-1s1r-common-configd
  - Reconcile Host 0-0 completed
  - Update ConfigMap ch1/chi-repl-1s1r-common-configd
  - Adding tables on shard/host:0/0 cluster:replcluster
  - Update Service ch1/chi-repl-1s1r-replcluster-0-0
  - Create StatefulSet ch1/chi-repl-1s1r-replcluster-0-0 - completed
  - Create StatefulSet ch1/chi-repl-1s1r-replcluster-0-0 - started
  - Update StatefulSet(ch1/chi-repl-1s1r-replcluster-0-0) - started
  - Update ConfigMap ch1/chi-repl-1s1r-deploy-confd-replcluster-0-0
  - Reconcile Host 0-0 started
  - Update ConfigMap ch1/chi-repl-1s1r-common-usersd
  - Update ConfigMap ch1/chi-repl-1s1r-common-configd
  - Update Service ch1/clickhouse-repl-1s1r
  - reconcile started
  - 'FAILED to reconcile StatefulSet: chi-repl-1s1r-replcluster-0-0 CHI: repl-1s1r '
  - Create StatefulSet ch1/chi-repl-1s1r-replcluster-0-0 - error ignored
  - Create StatefulSet ch1/chi-repl-1s1r-replcluster-0-0 - started
  - Update StatefulSet(ch1/chi-repl-1s1r-replcluster-0-0) - started
  - Update ConfigMap ch1/chi-repl-1s1r-deploy-confd-replcluster-0-0
  - Reconcile Host 0-0 started
  - Update ConfigMap ch1/chi-repl-1s1r-common-usersd
  - Update ConfigMap ch1/chi-repl-1s1r-common-configd
  - Update Service ch1/clickhouse-repl-1s1r
  - reconcile started
  - Create StatefulSet ch1/chi-repl-1s1r-replcluster-0-0 - started
  - Update StatefulSet(ch1/chi-repl-1s1r-replcluster-0-0) - started
  - Update ConfigMap ch1/chi-repl-1s1r-deploy-confd-replcluster-0-0
  - Reconcile Host 0-0 started
  - Update ConfigMap ch1/chi-repl-1s1r-common-usersd
  - Update ConfigMap ch1/chi-repl-1s1r-common-configd
  - Update Service ch1/clickhouse-repl-1s1r
  - reconcile started
  - 'FAILED to reconcile StatefulSet: chi-repl-1s1r-replcluster-0-0 CHI: repl-1s1r '
  - Create StatefulSet ch1/chi-repl-1s1r-replcluster-0-0 - error ignored
  - Create StatefulSet ch1/chi-repl-1s1r-replcluster-0-0 - started
  - Update ConfigMap ch1/chi-repl-1s1r-deploy-confd-replcluster-0-0
  - Reconcile Host 0-0 started
  - Update ConfigMap ch1/chi-repl-1s1r-common-usersd
  - Update ConfigMap ch1/chi-repl-1s1r-common-configd
  - Update Service ch1/clickhouse-repl-1s1r
  - reconcile started
  - 'FAILED to reconcile StatefulSet: chi-repl-1s1r-replcluster-0-0 CHI: repl-1s1r '
  - 'Create StatefulSet ch1/chi-repl-1s1r-replcluster-0-0 - failed with error StatefulSet.apps
    "chi-repl-1s1r-replcluster-0-0" is invalid: spec.template.spec.restartPolicy:
    Unsupported value: "Never": supported values: "Always"'
  - Create StatefulSet ch1/chi-repl-1s1r-replcluster-0-0 - started
  - Update ConfigMap ch1/chi-repl-1s1r-deploy-confd-replcluster-0-0
  - Reconcile Host 0-0 started
  - Update ConfigMap ch1/chi-repl-1s1r-common-usersd
  - Update ConfigMap ch1/chi-repl-1s1r-common-configd
  - Update Service ch1/clickhouse-repl-1s1r
  - reconcile started
  - 'FAILED to reconcile StatefulSet: chi-repl-1s1r-replcluster-0-0 CHI: repl-1s1r '
  - 'Create StatefulSet ch1/chi-repl-1s1r-replcluster-0-0 - failed with error StatefulSet.apps
    "chi-repl-1s1r-replcluster-0-0" is invalid: spec.template.spec.restartPolicy:
    Unsupported value: "OnFailure": supported values: "Always"'
  - Create StatefulSet ch1/chi-repl-1s1r-replcluster-0-0 - started
  - Update ConfigMap ch1/chi-repl-1s1r-deploy-confd-replcluster-0-0
  - Reconcile Host 0-0 started
  - Update ConfigMap ch1/chi-repl-1s1r-common-usersd
  - Update ConfigMap ch1/chi-repl-1s1r-common-configd
  - Update Service ch1/clickhouse-repl-1s1r
  - reconcile started
  - 'FAILED to reconcile StatefulSet: chi-repl-1s1r-replcluster-0-0 CHI: repl-1s1r '
  - 'Create StatefulSet ch1/chi-repl-1s1r-replcluster-0-0 - failed with error StatefulSet.apps
    "chi-repl-1s1r-replcluster-0-0" is invalid: spec.template.spec.restartPolicy:
    Unsupported value: "OnFailure": supported values: "Always"'
  - Create StatefulSet ch1/chi-repl-1s1r-replcluster-0-0 - started
  - Update StatefulSet(ch1/chi-repl-1s1r-replcluster-0-0) - started
  - Update ConfigMap ch1/chi-repl-1s1r-deploy-confd-replcluster-0-0
  - Reconcile Host 0-0 started
  - Update ConfigMap ch1/chi-repl-1s1r-common-usersd
  - Update ConfigMap ch1/chi-repl-1s1r-common-configd
  - Update Service ch1/clickhouse-repl-1s1r
  - reconcile started
  - 'FAILED to reconcile StatefulSet: chi-repl-1s1r-replcluster-0-0 CHI: repl-1s1r '
  - Create StatefulSet ch1/chi-repl-1s1r-replcluster-0-0 - error ignored
  - Create StatefulSet ch1/chi-repl-1s1r-replcluster-0-0 - started
  - |-
    Update StatefulSet(ch1/chi-repl-1s1r-replcluster-0-0) - failed with error
    ---
    onStatefulSetCreateFailed - stop
    --
    Continue with recreate
  - Update StatefulSet(ch1/chi-repl-1s1r-replcluster-0-0) - started
  - Update ConfigMap ch1/chi-repl-1s1r-deploy-confd-replcluster-0-0
  - Reconcile Host 0-0 started
  - Update ConfigMap ch1/chi-repl-1s1r-common-usersd
  - Update ConfigMap ch1/chi-repl-1s1r-common-configd
  - Update Service ch1/clickhouse-repl-1s1r
  - reconcile started
  - reconcile completed
  - add CHI to monitoring
  - remove items scheduled for deletion
  - remove items scheduled for deletion
  - Update ConfigMap ch1/chi-repl-1s1r-common-configd
  - Reconcile Host 0-0 completed
  - Update ConfigMap ch1/chi-repl-1s1r-common-configd
  - Create Service ch1/chi-repl-1s1r-replcluster-0-0
  - Update ConfigMap ch1/chi-repl-1s1r-deploy-confd-replcluster-0-0
  - Reconcile Host 0-0 started
  - Update ConfigMap ch1/chi-repl-1s1r-common-usersd
  - Update ConfigMap ch1/chi-repl-1s1r-common-configd
  - Update Service ch1/clickhouse-repl-1s1r
  - reconcile started
  - 'FAILED to reconcile StatefulSet: chi-repl-1s1r-replcluster-0-0 CHI: repl-1s1r '
  - Create StatefulSet ch1/chi-repl-1s1r-replcluster-0-0 - error ignored
  added: 1
  clusters: 1
  endpoint: clickhouse-repl-1s1r.ch1.svc.cluster.local
  error: 'FAILED update: onStatefulSetCreateFailed - ignore'
  errors:
  - 'FAILED update: onStatefulSetCreateFailed - ignore'
  - 'FAILED to reconcile StatefulSet: chi-repl-1s1r-replcluster-0-0 CHI: repl-1s1r '
  - 'FAILED update: onStatefulSetCreateFailed - ignore'
  - 'FAILED to reconcile StatefulSet: chi-repl-1s1r-replcluster-0-0 CHI: repl-1s1r '
  - 'FAILED update: StatefulSet.apps "chi-repl-1s1r-replcluster-0-0" is invalid: spec.template.spec.restartPolicy:
    Unsupported value: "Never": supported values: "Always"'
  - 'FAILED to reconcile StatefulSet: chi-repl-1s1r-replcluster-0-0 CHI: repl-1s1r '
  - 'Create StatefulSet ch1/chi-repl-1s1r-replcluster-0-0 - failed with error StatefulSet.apps
    "chi-repl-1s1r-replcluster-0-0" is invalid: spec.template.spec.restartPolicy:
    Unsupported value: "Never": supported values: "Always"'
  - 'FAILED update: StatefulSet.apps "chi-repl-1s1r-replcluster-0-0" is invalid: spec.template.spec.restartPolicy:
    Unsupported value: "OnFailure": supported values: "Always"'
  - 'FAILED to reconcile StatefulSet: chi-repl-1s1r-replcluster-0-0 CHI: repl-1s1r '
  - 'Create StatefulSet ch1/chi-repl-1s1r-replcluster-0-0 - failed with error StatefulSet.apps
    "chi-repl-1s1r-replcluster-0-0" is invalid: spec.template.spec.restartPolicy:
    Unsupported value: "OnFailure": supported values: "Always"'
  - 'FAILED update: StatefulSet.apps "chi-repl-1s1r-replcluster-0-0" is invalid: spec.template.spec.restartPolicy:
    Unsupported value: "OnFailure": supported values: "Always"'
  - 'FAILED to reconcile StatefulSet: chi-repl-1s1r-replcluster-0-0 CHI: repl-1s1r '
  - 'Create StatefulSet ch1/chi-repl-1s1r-replcluster-0-0 - failed with error StatefulSet.apps
    "chi-repl-1s1r-replcluster-0-0" is invalid: spec.template.spec.restartPolicy:
    Unsupported value: "OnFailure": supported values: "Always"'
  - 'FAILED update: onStatefulSetCreateFailed - ignore'
  - 'FAILED to reconcile StatefulSet: chi-repl-1s1r-replcluster-0-0 CHI: repl-1s1r '
  - |-
    Update StatefulSet(ch1/chi-repl-1s1r-replcluster-0-0) - failed with error
    ---
    onStatefulSetCreateFailed - stop
    --
    Continue with recreate
  - 'FAILED to drop replica on host 1-0 with error FAILED connect(http://***:***@chi-repl-1s1r-replcluster-0-0.ch1.svc.cluster.local:8123/)
    for SQL: SYSTEM DROP REPLICA ''chi-repl-1s1r-replcluster-1-0.ch1.svc.cluster.local'''
  - 'FAILED update: onStatefulSetCreateFailed - ignore'
  - 'FAILED to reconcile StatefulSet: chi-repl-1s1r-replcluster-0-0 CHI: repl-1s1r '
  - 'FAILED update: onStatefulSetCreateFailed - ignore'
  - 'FAILED to reconcile StatefulSet: chi-repl-1s1r-replcluster-0-0 CHI: repl-1s1r '
  fqdns:
  - chi-repl-1s1r-replcluster-0-0.ch1.svc.cluster.local
  generation: 11
  hosts: 1
  normalized:
    apiVersion: clickhouse.altinity.com/v1
    kind: ClickHouseInstallation
    metadata:
      creationTimestamp: "2022-01-04T04:38:10Z"
      finalizers:
      - finalizer.clickhouseinstallation.altinity.com
      generation: 11
      managedFields:
      - apiVersion: clickhouse.altinity.com/v1
        fieldsType: FieldsV1
        fieldsV1:
          f:metadata:
            f:finalizers:
              .: {}
              v:"finalizer.clickhouseinstallation.altinity.com": {}
          f:spec:
            f:templates:
              f:volumeClaimTemplates: {}
          f:status:
            .: {}
            f:action: {}
            f:actions: {}
            f:added: {}
            f:clusters: {}
            f:endpoint: {}
            f:error: {}
            f:errors: {}
            f:fqdns: {}
            f:generation: {}
            f:hosts: {}
            f:normalized:
              .: {}
              f:apiVersion: {}
              f:kind: {}
              f:metadata: {}
              f:spec: {}
              f:status: {}
            f:pods: {}
            f:replicas: {}
            f:shards: {}
            f:status: {}
            f:taskID: {}
            f:taskIDsCompleted: {}
            f:taskIDsStarted: {}
            f:version: {}
        manager: clickhouse-operator
        operation: Update
        time: "2022-01-04T16:39:09Z"
      - apiVersion: clickhouse.altinity.com/v1
        fieldsType: FieldsV1
        fieldsV1:
          f:metadata:
            f:annotations:
              .: {}
              f:kubectl.kubernetes.io/last-applied-configuration: {}
          f:spec:
            .: {}
            f:configuration:
              .: {}
              f:clusters: {}
              f:zookeeper:
                .: {}
                f:nodes: {}
            f:defaults:
              .: {}
              f:distributedDDL:
                .: {}
                f:profile: {}
              f:replicasUseFQDN: {}
            f:templates:
              .: {}
              f:podTemplates: {}
              f:volumeClaimTemplates: {}
        manager: kubectl
        operation: Update
        time: "2022-01-04T18:48:24Z"
      name: repl-1s1r
      namespace: ch1
      resourceVersion: "334777111"
      uid: 5432de60-d78b-4bc3-ac71-909ef8ae899b
    spec:
      configuration:
        clusters:
        - layout:
            replicas:
            - name: "0"
              shards:
              - httpPort: 8123
                interserverHTTPPort: 9009
                name: 0-0
                tcpPort: 9000
                templates:
                  podTemplate: clickhouse-with-volume-template
              shardsCount: 1
              templates:
                podTemplate: clickhouse-with-volume-template
            replicasCount: 1
            shards:
            - internalReplication: "false"
              name: "0"
              replicas:
              - httpPort: 8123
                interserverHTTPPort: 9009
                name: 0-0
                tcpPort: 9000
                templates:
                  podTemplate: clickhouse-with-volume-template
              replicasCount: 1
              templates:
                podTemplate: clickhouse-with-volume-template
            shardsCount: 1
          name: replcluster
          templates:
            podTemplate: clickhouse-with-volume-template
          zookeeper:
            nodes:
            - host: zookeeper-0.zookeepers.ch2.svc.uhxxxxxx7.local
              port: 2181
            - host: zookeeper-1.zookeepers.ch2.svc.uhxxxxxx7.local
              port: 2181
            - host: zookeeper-2.zookeepers.ch2.svc.uhxxxxxx7.local
              port: 2181
        users:
          default/networks/host_regexp: (chi-repl-1s1r-[^.]+\d+-\d+|clickhouse\-repl-1s1r)\.ch1\.svc\.cluster\.local$
          default/networks/ip:
          - ::1
          - 127.0.0.1
          default/profile: default
          default/quota: default
        zookeeper:
          nodes:
          - host: zookeeper-0.zookeepers.ch2.svc.uhxxxxxx7.local
            port: 2181
          - host: zookeeper-1.zookeepers.ch2.svc.uhxxxxxx7.local
            port: 2181
          - host: zookeeper-2.zookeepers.ch2.svc.uhxxxxxx7.local
            port: 2181
      defaults:
        distributedDDL:
          profile: default
        replicasUseFQDN: "true"
      reconciling:
        cleanup:
          reconcileFailedObjects:
            configMap: Retain
            pvc: Retain
            service: Retain
            statefulSet: Retain
          unknownObjects:
            configMap: Delete
            pvc: Delete
            service: Delete
            statefulSet: Delete
        configMapPropagationTimeout: 60
        policy: unspecified
      stop: "false"
      taskID: 9f4049b4-dab5-4168-9198-3b8612a7fc79
      templates:
        PodTemplatesIndex: {}
        VolumeClaimTemplatesIndex: {}
        podTemplates:
        - metadata:
            creationTimestamp: null
          name: clickhouse-with-volume-template
          spec:
            containers:
            - image: private-docker-registry/clickhouse-server:21.12.3.32.ipv6
              name: clickhouse-pod
              resources:
                requests:
                  cpu: "4"
                  memory: 32G
              volumeMounts:
              - mountPath: /var/lib/clickhouse
                name: clickhouse-storage-template
            nodeSelector:
              robin.io/rnodetype: robin-worker-node
            tolerations:
            - effect: NoSchedule
              key: k8s.sssssss.com/worker
              operator: Exists
          zone: {}
        volumeClaimTemplates:
        - metadata:
            creationTimestamp: null
          name: clickhouse-storage-template
          reclaimPolicy: Delete
          spec:
            accessModes:
            - ReadWriteOnce
            resources:
              requests:
                storage: 50Gi
      templating:
        policy: manual
      troubleshoot: "false"
    status:
      clusters: 0
      hosts: 0
      replicas: 0
      shards: 0
      status: ""
  pods:
  - chi-repl-1s1r-replcluster-0-0-0
  replicas: 0
  shards: 1
  status: Completed
  taskID: 9f4049b4-dab5-4168-9198-3b8612a7fc79
  taskIDsCompleted:
  - 9f4049b4-dab5-4168-9198-3b8612a7fc79
  - 2ddcd52d-fbb3-4ab6-9d05-b17d4a9b688d
  taskIDsStarted:
  - 9f4049b4-dab5-4168-9198-3b8612a7fc79
  - 49353693-ef6f-45ae-b24f-3c82f9ca779e
  - 567a480c-9668-449d-b144-a26fc850a37d
  - 5c9adbee-97b1-43d9-a721-8b4104e4d9d1
  - e6039ff5-74c2-415d-b71e-320d62ff158c
  - 79008ff8-fbe7-4e04-809a-bb61f29853cf
  - b44df750-344a-426a-8e6e-f750044ae19c
  - cbe3cb8b-7681-4c9e-a935-2aa1ab02eda7
  - 2ddcd52d-fbb3-4ab6-9d05-b17d4a9b688d
  - c691d200-6e7a-4140-88ae-a031404b6eca
  - b9f98bce-8eb8-41c9-a8c1-34383e08b771
  version: 0.18.0

obfuscated some values due to confidential info

Slach commented 2 years ago

chi-repl-1s1r-replcluster-0-0 ClusterIP None

Is not look good, other type: ClusterIP service have ipv6 address

could you share

kubectl get endpoints -n ch1 

and

kubectl get svc -n ch1 chi-repl-1s1r-replcluster-0-0 -o yaml

and

kubectl describe svc -n ch1 chi-repl-1s1r-replcluster-0-0 
ragsarang commented 2 years ago

Is not look good, other type: ClusterIP service have ipv6 address Yes, our kubernetes cluster is having only ipv6 protocol

kubectl get endpoints -n ch1

NAME                            ENDPOINTS                                                                                                                 AGE
chi-repl-1s1r-replcluster-0-0   [xxxx:xxxx:xxx:xxxx:xxxx:xx:0:4e88]:8123,[xxxx:xxxx:xxx:xxxx:xxxx:xx:0:4e88]:9009,[xxxx:xxxx:xxx:xxxx:xxxx:xx:0:4e88]:9000   30h
clickhouse-operator-metrics     [xxxx:xxxx:xxx:xxxx:xxxx:xx:0:47a1]:8888                                                                                   16d
clickhouse-repl-1s1r            [xxxx:xxxx:xxx:xxxx:xxxx:xx:0:4e88]:8123,[xxxx:xxxx:xxx:xxxx:xxxx:xx:0:4e88]:9000                                           33h
kubectl get svc -n ch1 chi-repl-1s1r-replcluster-0-0 -o yaml
apiVersion: v1
kind: Service
metadata:
  creationTimestamp: "2022-01-04T07:49:20Z"
  labels:
    clickhouse.altinity.com/Service: host
    clickhouse.altinity.com/app: chop
    clickhouse.altinity.com/chi: repl-1s1r
    clickhouse.altinity.com/cluster: replcluster
    clickhouse.altinity.com/namespace: ch1
    clickhouse.altinity.com/object-version: c770326e6ee81e25b1a7b91bb9c9100c00bd7d41
    clickhouse.altinity.com/replica: "0"
    clickhouse.altinity.com/shard: "0"
  managedFields:
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:labels:
          .: {}
          f:clickhouse.altinity.com/Service: {}
          f:clickhouse.altinity.com/app: {}
          f:clickhouse.altinity.com/chi: {}
          f:clickhouse.altinity.com/cluster: {}
          f:clickhouse.altinity.com/namespace: {}
          f:clickhouse.altinity.com/object-version: {}
          f:clickhouse.altinity.com/replica: {}
          f:clickhouse.altinity.com/shard: {}
        f:ownerReferences:
          .: {}
          k:{"uid":"5432de60-d78b-4bc3-ac71-909ef8ae899b"}:
            .: {}
            f:apiVersion: {}
            f:blockOwnerDeletion: {}
            f:controller: {}
            f:kind: {}
            f:name: {}
            f:uid: {}
      f:spec:
        f:clusterIP: {}
        f:ports:
          .: {}
          k:{"port":8123,"protocol":"TCP"}:
            .: {}
            f:name: {}
            f:port: {}
            f:protocol: {}
            f:targetPort: {}
          k:{"port":9000,"protocol":"TCP"}:
            .: {}
            f:name: {}
            f:port: {}
            f:protocol: {}
            f:targetPort: {}
          k:{"port":9009,"protocol":"TCP"}:
            .: {}
            f:name: {}
            f:port: {}
            f:protocol: {}
            f:targetPort: {}
        f:publishNotReadyAddresses: {}
        f:selector:
          .: {}
          f:clickhouse.altinity.com/app: {}
          f:clickhouse.altinity.com/chi: {}
          f:clickhouse.altinity.com/cluster: {}
          f:clickhouse.altinity.com/namespace: {}
          f:clickhouse.altinity.com/replica: {}
          f:clickhouse.altinity.com/shard: {}
        f:sessionAffinity: {}
        f:type: {}
    manager: clickhouse-operator
    operation: Update
    time: "2022-01-04T07:49:20Z"
  name: chi-repl-1s1r-replcluster-0-0
  namespace: ch1
  ownerReferences:
  - apiVersion: clickhouse.altinity.com/v1
    blockOwnerDeletion: true
    controller: true
    kind: ClickHouseInstallation
    name: repl-1s1r
    uid: 5432de60-d78b-4bc3-ac71-909ef8ae899b
  resourceVersion: "334136070"
  uid: cdb0130e-03c8-449e-ad97-f2504c47efd4
spec:
  clusterIP: None
  clusterIPs:
  - None
  ports:
  - name: http
    port: 8123
    protocol: TCP
    targetPort: 8123
  - name: tcp
    port: 9000
    protocol: TCP
    targetPort: 9000
  - name: interserver
    port: 9009
    protocol: TCP
    targetPort: 9009
  publishNotReadyAddresses: true
  selector:
    clickhouse.altinity.com/app: chop
    clickhouse.altinity.com/chi: repl-1s1r
    clickhouse.altinity.com/cluster: replcluster
    clickhouse.altinity.com/namespace: ch1
    clickhouse.altinity.com/replica: "0"
    clickhouse.altinity.com/shard: "0"
  sessionAffinity: None
  type: ClusterIP
status:
  loadBalancer: {}

chi-repl-1s1r-replcluster-0-0 ClusterIP None

Is not look good, other type: ClusterIP service have ipv6 address

could you share

kubectl get endpoints -n ch1 

and

kubectl get svc -n ch1 chi-repl-1s1r-replcluster-0-0 -o yaml

and

kubectl describe svc -n ch1 chi-repl-1s1r-replcluster-0-0 
Name:              chi-repl-1s1r-replcluster-0-0
Namespace:         ch1
Labels:            clickhouse.altinity.com/Service=host
                   clickhouse.altinity.com/app=chop
                   clickhouse.altinity.com/chi=repl-1s1r
                   clickhouse.altinity.com/cluster=replcluster
                   clickhouse.altinity.com/namespace=ch1
                   clickhouse.altinity.com/object-version=c770326e6ee81e25b1a7b91bb9c9100c00bd7d41
                   clickhouse.altinity.com/replica=0
                   clickhouse.altinity.com/shard=0
Annotations:       <none>
Selector:          clickhouse.altinity.com/app=chop,clickhouse.altinity.com/chi=repl-1s1r,clickhouse.altinity.com/cluster=replcluster,clickhouse.altinity.com/namespace=ch1,clickhouse.altinity.com/replica=0,clickhouse.altinity.com/shard=0
Type:              ClusterIP
IP:                None
Port:              http  8123/TCP
TargetPort:        8123/TCP
Endpoints:         [xxxx:xxxx:xxx:xxxx:xxxx:xx:0:4e88]:8123
Port:              tcp  9000/TCP
TargetPort:        9000/TCP
Endpoints:         [xxxx:xxxx:xxx:xxxx:xxxx:xx:0:4e88]:9000
Port:              interserver  9009/TCP
TargetPort:        9009/TCP
Endpoints:         [xxxx:xxxx:xxx:xxxx:xxxx:xx:0:4e88]:9009
Session Affinity:  None
Events:            <none>
ragsarang commented 2 years ago

Additional info: I updated the clickhouse operator yaml parameter chConfigNetworksHostRegexpTemplate

chConfigNetworksHostRegexpTemplate: "(chi-{chi}-[^.]+\\d+-\\d+|clickhouse\\-{chi})\\.{namespace}\\.svc\\.uhxxxxxx\\.local$"

Because our cluster has specific clusterDomain "uhxxxxxx.local" But the common-configd configmap is still taking cluster.local in the chop-generated-remote_servers.xml section.

Due to which the hostname in the system.clusters is having cluster.local and it is not able to resolve any of them

Slach commented 2 years ago

After upgrade chConfigNetworksHostRegexpTemplate you should restrart clickhouse-operator deployment manually it allow us control changes in clickhouse-operator

and re-apply kind: ClickHouse installation manifest (change something in manifest)

ragsarang commented 2 years ago

Yes I tried it. I created new operator installation with this change chConfigNetworksHostRegexpTemplate but the endpoints are still taking cluster.local

I even added this before deploying default/networks/host_regexp: (chi-repl-1s1r-[^.]+\d+-\d+|clickhouse-repl-1s1r).ch1.svc.uhxxxxxx.local$

$ kubectl get chi repl-1s1r -n ch -oyaml | grep local
      {"apiVersion":"clickhouse.altinity.com/v1","kind":"ClickHouseInstallation","metadata":{"annotations":{},"name":"repl-1s1r","namespace":"ch"},"spec":{"configuration":{"clusters":[{"layout":{"replicasCount":2,"shardsCount":2},"name":"replcluster","templates":{"podTemplate":"clickhouse-with-volume-template"}}],"users":{"default/networks/host_regexp":"(chi-repl-1s1r-[^.]+\\d+-\\d+|clickhouse\\-repl-1s1r)\\.ch1\\.svc\\.uhxxxxxxx\\.local$"},"zookeeper":{"nodes":[{"host":"zookeeper-0.zookeepers.ch2.svc.uhxxxxxxx.local","port":2181},{"host":"zookeeper-1.zookeepers.ch2.svc.uhxxxxxxx.local","port":2181},{"host":"zookeeper-2.zookeepers.ch2.svc.uhxxxxxxx.local","port":2181}]}},"defaults":{"distributedDDL":{"profile":"default"},"replicasUseFQDN":"yes"},"templates":{"podTemplates":[{"name":"clickhouse-with-volume-template","spec":{"containers":[{"image":"private-docker-registry/clickhouse-server:21.12.3.32.ipv6","name":"clickhouse-pod","resources":{"requests":{"cpu":2,"memory":"32G"}},"volumeMounts":[{"mountPath":"/var/lib/clickhouse","name":"clickhouse-storage-template"}]}],"nodeSelector":{"robin.io/rnodetype":"robin-worker-node"},"tolerations":[{"effect":"NoSchedule","key":"k8s.ssssssss.com/worker","operator":"Exists"}]}}],"volumeClaimTemplates":[{"name":"clickhouse-storage-template","spec":{"accessModes":["ReadWriteOnce"],"resources":{"requests":{"storage":"50Gi"}}}}]}}}
      default/networks/host_regexp: (chi-repl-1s1r-[^.]+\d+-\d+|clickhouse\-repl-1s1r)\.ch1\.svc\.uhxxxxxxx\.local$
      - host: zookeeper-0.zookeepers.ch2.svc.uhxxxxxxx.local
      - host: zookeeper-1.zookeepers.ch2.svc.uhxxxxxxx.local
      - host: zookeeper-2.zookeepers.ch2.svc.uhxxxxxxx.local
        - image: private-docker-registry/clickhouse-server:21.12.3.32.ipv6
  endpoint: clickhouse-repl-1s1r.ch.svc.cluster.local
  - chi-repl-1s1r-replcluster-0-0.ch.svc.cluster.local
  - chi-repl-1s1r-replcluster-0-1.ch.svc.cluster.local
  - chi-repl-1s1r-replcluster-1-0.ch.svc.cluster.local
  - chi-repl-1s1r-replcluster-1-1.ch.svc.cluster.local
            - host: zookeeper-0.zookeepers.ch2.svc.uhxxxxxxx.local
            - host: zookeeper-1.zookeepers.ch2.svc.uhxxxxxxx.local
            - host: zookeeper-2.zookeepers.ch2.svc.uhxxxxxxx.local
          default/networks/host_regexp: (chi-repl-1s1r-[^.]+\d+-\d+|clickhouse\-repl-1s1r)\.ch1\.svc\.uhxxxxxxx\.local$
          - host: zookeeper-0.zookeepers.ch2.svc.uhxxxxxxx.local
          - host: zookeeper-1.zookeepers.ch2.svc.uhxxxxxxx.local
          - host: zookeeper-2.zookeepers.ch2.svc.uhxxxxxxx.local
            - image: private-docker-registry/clickhouse-server:21.12.3.32.ipv6

This is the chi file used for the deployment

apiVersion: "clickhouse.altinity.com/v1"
kind: "ClickHouseInstallation"
metadata:
  name: "repl-1s1r"
spec:
  defaults:
    replicasUseFQDN: "yes"
    distributedDDL:
      profile: default
  configuration:
    zookeeper:
      nodes:
        - host: zookeeper-0.zookeepers.ch2.svc.uhxxxxxxx.local
          port: 2181
        - host: zookeeper-1.zookeepers.ch2.svc.uhxxxxxxx.local
          port: 2181
        - host: zookeeper-2.zookeepers.ch2.svc.uhxxxxxxx.local
          port: 2181
    clusters:
      - name: replcluster
        templates:
          podTemplate: clickhouse-with-volume-template
        layout:
          shardsCount: 2
          replicasCount: 2
    users:
      default/networks/host_regexp: (chi-repl-1s1r-[^.]+\d+-\d+|clickhouse\-repl-1s1r)\.ch1\.svc\.uhxxxxxxx\.local$
  templates:
    podTemplates:
      - name: clickhouse-with-volume-template
        spec:
          containers:
            - name: clickhouse-pod
              image: private-docker-registry/clickhouse-server:21.12.3.32.ipv6
              volumeMounts:
                - name: clickhouse-storage-template
                  mountPath: /var/lib/clickhouse
              resources:
                requests:
                  cpu: 2
                  memory: 32G
          #restartPolicy: Always
          nodeSelector:
            robin.io/rnodetype: "robin-worker-node"
          #  robin.io/rnodetype: "robin-master-node"
          tolerations:
          - key: "k8s.ssssssss.com/worker"
            operator: "Exists"
            effect: "NoSchedule"

    volumeClaimTemplates:
      - name: clickhouse-storage-template
        spec:
          # no storageClassName - means use default storageClassName
          #storageClassName: default
          accessModes:
            - ReadWriteOnce
          resources:
            requests:
              storage: 50Gi
Slach commented 2 years ago

change kind: ClickHouseInstallation

add

spec:
  namespaceDomainPattern: "%s.svc.uhxxxxxxx.local"
ragsarang commented 2 years ago

change kind: ClickHouseInstallation

add

spec:
  namespaceDomainPattern: "%s.svc.uhxxxxxxx.local"

I might be looking for this exact configuration for the cluster domain.. I will try it out and update you

ragsarang commented 2 years ago

I added namespaceDomainPattern in CHI yaml and recreated entire setup from operator to CHI. This parameter has fixed the hostname rendering in system.clusters. However base problem still exists - distributed DDL times out (error log below). based on the logs, it seems like it is still getting the cluster.local from somewhere

SELECT *
FROM system.clusters

Query id: 1f7a32c6-2e1a-4589-8acb-74b5a522ea76

┌─cluster──────────────────────────────────────┬─shard_num─┬─shard_weight─┬─replica_num─┬─host_name─────────────────────────────────────────────┬─host_address─────────────────────┬─port─┬─is_local─┬─user────┬─default_database─┬─errors_count─┬─slowdowns_count─┬─estimated_recovery_time─┐
│ all-replicated                               │         1 │            1 │           1 │ chi-repl-1s1r-replcluster-0-0.ch1.svc.uhxxxxxxx.local │ 240b:c0e0:104:544d:b464:2:0:484d │ 9000 │        1 │ default │                  │            0 │               0 │                       0 │
│ all-replicated                               │         1 │            1 │           2 │ chi-repl-1s1r-replcluster-0-1.ch1.svc.uhxxxxxxx.local │ 240b:c0e0:104:544d:b464:2:0:4e0f │ 9000 │        0 │ default │                  │            0 │               0 │                       0 │
│ all-replicated                               │         1 │            1 │           3 │ chi-repl-1s1r-replcluster-1-0.ch1.svc.uhxxxxxxx.local │ 240b:c0e0:104:544d:b464:2:0:4949 │ 9000 │        0 │ default │                  │            0 │               0 │                       0 │
│ all-replicated                               │         1 │            1 │           4 │ chi-repl-1s1r-replcluster-1-1.ch1.svc.uhxxxxxxx.local │ 240b:c0e0:104:544d:b464:2:0:4883 │ 9000 │        0 │ default │                  │            0 │               0 │                       0 │
│ all-sharded                                  │         1 │            1 │           1 │ chi-repl-1s1r-replcluster-0-0.ch1.svc.uhxxxxxxx.local │ 240b:c0e0:104:544d:b464:2:0:484d │ 9000 │        1 │ default │                  │            0 │               0 │                       0 │
│ all-sharded                                  │         2 │            1 │           1 │ chi-repl-1s1r-replcluster-0-1.ch1.svc.uhxxxxxxx.local │ 240b:c0e0:104:544d:b464:2:0:4e0f │ 9000 │        0 │ default │                  │            0 │               0 │                       0 │
│ all-sharded                                  │         3 │            1 │           1 │ chi-repl-1s1r-replcluster-1-0.ch1.svc.uhxxxxxxx.local │ 240b:c0e0:104:544d:b464:2:0:4949 │ 9000 │        0 │ default │                  │            0 │               0 │                       0 │
│ all-sharded                                  │         4 │            1 │           1 │ chi-repl-1s1r-replcluster-1-1.ch1.svc.uhxxxxxxx.local │ 240b:c0e0:104:544d:b464:2:0:4883 │ 9000 │        0 │ default │                  │            0 │               0 │                       0 │
│ replcluster                                  │         1 │            1 │           1 │ chi-repl-1s1r-replcluster-0-0.ch1.svc.uhxxxxxxx.local │ 240b:c0e0:104:544d:b464:2:0:484d │ 9000 │        1 │ default │                  │            0 │               0 │                       0 │
│ replcluster                                  │         1 │            1 │           2 │ chi-repl-1s1r-replcluster-0-1.ch1.svc.uhxxxxxxx.local │ 240b:c0e0:104:544d:b464:2:0:4e0f │ 9000 │        0 │ default │                  │            0 │               0 │                       0 │
│ replcluster                                  │         2 │            1 │           1 │ chi-repl-1s1r-replcluster-1-0.ch1.svc.uhxxxxxxx.local │ 240b:c0e0:104:544d:b464:2:0:4949 │ 9000 │        0 │ default │                  │            0 │               0 │                       0 │
│ replcluster                                  │         2 │            1 │           2 │ chi-repl-1s1r-replcluster-1-1.ch1.svc.uhxxxxxxx.local │ 240b:c0e0:104:544d:b464:2:0:4883 │ 9000 │        0 │ default │                  │            0 │               0 │                       0 │
│ test_cluster_two_shards                      │         1 │            1 │           1 │ 127.0.0.1                                             │ 127.0.0.1                        │ 9000 │        1 │ default │                  │            0 │               0 │                       0 │
│ test_cluster_two_shards                      │         2 │            1 │           1 │ 127.0.0.2                                             │ 127.0.0.2                        │ 9000 │        0 │ default │                  │            0 │               0 │                       0 │
│ test_cluster_two_shards_internal_replication │         1 │            1 │           1 │ 127.0.0.1                                             │ 127.0.0.1                        │ 9000 │        1 │ default │                  │            0 │               0 │                       0 │
│ test_cluster_two_shards_internal_replication │         2 │            1 │           1 │ 127.0.0.2                                             │ 127.0.0.2                        │ 9000 │        0 │ default │                  │            0 │               0 │                       0 │
│ test_cluster_two_shards_localhost            │         1 │            1 │           1 │ localhost                                             │ ::1                              │ 9000 │        1 │ default │                  │            0 │               0 │                       0 │
│ test_cluster_two_shards_localhost            │         2 │            1 │           1 │ localhost                                             │ ::1                              │ 9000 │        1 │ default │                  │            0 │               0 │                       0 │
│ test_shard_localhost                         │         1 │            1 │           1 │ localhost                                             │ ::1                              │ 9000 │        1 │ default │                  │            0 │               0 │                       0 │
│ test_shard_localhost_secure                  │         1 │            1 │           1 │ localhost                                             │ ::1                              │ 9440 │        0 │ default │                  │            0 │               0 │                       0 │
│ test_unavailable_shard                       │         1 │            1 │           1 │ localhost                                             │ ::1                              │ 9000 │        1 │ default │                  │            0 │               0 │                       0 │
│ test_unavailable_shard                       │         2 │            1 │           1 │ localhost                                             │ ::1                              │    1 │        0 │ default │                  │            0 │               0 │                       0 │
└──────────────────────────────────────────────┴───────────┴──────────────┴─────────────┴───────────────────────────────────────────────────────┴──────────────────────────────────┴──────┴──────────┴─────────┴──────────────────┴──────────────┴─────────────────┴─────────────────────────┘

22 rows in set. Elapsed: 0.002 sec.
$ kubectl logs pod/chi-repl-1s1r-replcluster-0-0-0 -n ch1 | tail -100 | grep DNS
2022.01.06 19:39:51.969738 [ 246 ] {} <Error> DNSResolver: Cannot resolve host (chi-repl-1s1r-replcluster-0-0.ch1.svc.cluster.local), error 0: chi-repl-1s1r-replcluster-0-0.ch1.svc.cluster.local.
2022.01.06 19:39:51.969962 [ 246 ] {} <Error> DDLWorker: Unexpected error, will try to restart main thread:: Code: 198. DB::Exception: Not found address of host: chi-repl-1s1r-replcluster-0-0.ch1.svc.cluster.local. (DNS_ERROR), Stack trace (when copying this message, always include the lines below):
3. DB::DNSResolver::resolveAddress(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, unsigned short) @ 0xa2c8ea3 in /usr/bin/clickhouse
2022.01.06 19:39:56.986005 [ 246 ] {} <Error> DNSResolver: Cannot resolve host (chi-repl-1s1r-replcluster-0-0.ch1.svc.cluster.local), error 0: chi-repl-1s1r-replcluster-0-0.ch1.svc.cluster.local.
2022.01.06 19:39:56.986224 [ 246 ] {} <Error> DDLWorker: Unexpected error, will try to restart main thread:: Code: 198. DB::Exception: Not found address of host: chi-repl-1s1r-replcluster-0-0.ch1.svc.cluster.local. (DNS_ERROR), Stack trace (when copying this message, always include the lines below):
3. DB::DNSResolver::resolveAddress(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, unsigned short) @ 0xa2c8ea3 in /usr/bin/clickhouse
(version 21.12.3.32 (official build))
2022.01.06 19:59:21.319037 [ 172 ] {} <Debug> DNSResolver: Updated DNS cache
2022.01.06 19:59:21.661099 [ 108 ] {} <Debug> DiskLocal: Reserving 1.00 MiB on disk `default`, having unreserved 46.48 GiB.
2022.01.06 19:59:21.690797 [ 111 ] {} <Debug> DiskLocal: Reserving 1.00 MiB on disk `default`, having unreserved 46.48 GiB.
2022.01.06 19:59:22.253401 [ 110 ] {} <Debug> DiskLocal: Reserving 1.00 MiB on disk `default`, having unreserved 46.48 GiB.
2022.01.06 19:59:22.936695 [ 85 ] {} <Debug> system.session_log (75cebf14-fe4e-4992-b5ce-bf14fe4e4992): Removing part from filesystem 202201_1_105_21
2022.01.06 19:59:22.937126 [ 85 ] {} <Debug> system.session_log (75cebf14-fe4e-4992-b5ce-bf14fe4e4992): Removing part from filesystem 202201_106_106_0
2022.01.06 19:59:22.937385 [ 85 ] {} <Debug> system.session_log (75cebf14-fe4e-4992-b5ce-bf14fe4e4992): Removing part from filesystem 202201_107_107_0
2022.01.06 19:59:22.937632 [ 85 ] {} <Debug> system.session_log (75cebf14-fe4e-4992-b5ce-bf14fe4e4992): Removing part from filesystem 202201_108_108_0
2022.01.06 19:59:22.937854 [ 85 ] {} <Debug> system.session_log (75cebf14-fe4e-4992-b5ce-bf14fe4e4992): Removing part from filesystem 202201_109_109_0
2022.01.06 19:59:22.938103 [ 85 ] {} <Debug> system.session_log (75cebf14-fe4e-4992-b5ce-bf14fe4e4992): Removing part from filesystem 202201_110_110_0
2022.01.06 19:59:24.572079 [ 106 ] {} <Debug> DiskLocal: Reserving 1.00 MiB on disk `default`, having unreserved 46.48 GiB.
2022.01.06 19:59:26.237852 [ 99 ] {02d4001d-75e8-46a7-99a7-cba58f5ba31e} <Error> executeQuery: Code: 159. DB::Exception: Watching task /clickhouse/repl-1s1r/task_queue/ddl/query-0000000013 is executing longer than distributed_ddl_task_timeout (=180) seconds. There are 4 unfinished hosts (0 of them are currently active), they are going to execute the query in background. (TIMEOUT_EXCEEDED) (version 21.12.3.32 (official build)) (from [::1]:35892) (in query: CREATE DATABASE test ON CLUSTER '{cluster}';), Stack trace (when copying this message, always include the lines below):

0. DB::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int, bool) @ 0xa21959a in /usr/bin/clickhouse
1. DB::Exception::Exception<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >&, long&, unsigned long&, unsigned long&>(int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >&, long&, unsigned long&, unsigned long&) @ 0x13526564 in /usr/bin/clickhouse
2. ? @ 0x13522da5 in /usr/bin/clickhouse
3. DB::DDLQueryStatusSource::generate() @ 0x1352121e in /usr/bin/clickhouse
4. DB::ISource::tryGenerate() @ 0x14024515 in /usr/bin/clickhouse
5. DB::ISource::work() @ 0x140240da in /usr/bin/clickhouse
6. DB::SourceWithProgress::work() @ 0x1423d742 in /usr/bin/clickhouse
7. DB::ExecutionThreadContext::executeTask() @ 0x14043ae3 in /usr/bin/clickhouse
8. DB::PipelineExecutor::executeStepImpl(unsigned long, std::__1::atomic<bool>*) @ 0x1403835e in /usr/bin/clickhouse
9. DB::PipelineExecutor::executeImpl(unsigned long) @ 0x140371a9 in /usr/bin/clickhouse
10. DB::PipelineExecutor::execute(unsigned long) @ 0x14036eb8 in /usr/bin/clickhouse
11. ? @ 0x14047607 in /usr/bin/clickhouse
12. ThreadPoolImpl<std::__1::thread>::worker(std::__1::__list_iterator<std::__1::thread, void*>) @ 0xa25a3b7 in /usr/bin/clickhouse
13. ? @ 0xa25ddbd in /usr/bin/clickhouse
14. ? @ 0x7f273cf8e609 in ?
15. clone @ 0x7f273ceb5293 in ?

2022.01.06 19:59:26.237956 [ 99 ] {02d4001d-75e8-46a7-99a7-cba58f5ba31e} <Error> TCPHandler: Code: 159. DB::Exception: Watching task /clickhouse/repl-1s1r/task_queue/ddl/query-0000000013 is executing longer than distributed_ddl_task_timeout (=180) seconds. There are 4 unfinished hosts (0 of them are currently active), they are going to execute the query in background. (TIMEOUT_EXCEEDED), Stack trace (when copying this message, always include the lines below):

0. DB::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int, bool) @ 0xa21959a in /usr/bin/clickhouse
1. DB::Exception::Exception<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >&, long&, unsigned long&, unsigned long&>(int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >&, long&, unsigned long&, unsigned long&) @ 0x13526564 in /usr/bin/clickhouse
2. ? @ 0x13522da5 in /usr/bin/clickhouse
3. DB::DDLQueryStatusSource::generate() @ 0x1352121e in /usr/bin/clickhouse
4. DB::ISource::tryGenerate() @ 0x14024515 in /usr/bin/clickhouse
5. DB::ISource::work() @ 0x140240da in /usr/bin/clickhouse
6. DB::SourceWithProgress::work() @ 0x1423d742 in /usr/bin/clickhouse
7. DB::ExecutionThreadContext::executeTask() @ 0x14043ae3 in /usr/bin/clickhouse
8. DB::PipelineExecutor::executeStepImpl(unsigned long, std::__1::atomic<bool>*) @ 0x1403835e in /usr/bin/clickhouse
9. DB::PipelineExecutor::executeImpl(unsigned long) @ 0x140371a9 in /usr/bin/clickhouse
10. DB::PipelineExecutor::execute(unsigned long) @ 0x14036eb8 in /usr/bin/clickhouse
11. ? @ 0x14047607 in /usr/bin/clickhouse
12. ThreadPoolImpl<std::__1::thread>::worker(std::__1::__list_iterator<std::__1::thread, void*>) @ 0xa25a3b7 in /usr/bin/clickhouse
13. ? @ 0xa25ddbd in /usr/bin/clickhouse
14. ? @ 0x7f273cf8e609 in ?
15. clone @ 0x7f273ceb5293 in ?

2022.01.06 19:59:26.238093 [ 99 ] {02d4001d-75e8-46a7-99a7-cba58f5ba31e} <Debug> MemoryTracker: Peak memory usage (for query): 0.00 B.
2022.01.06 19:59:26.238133 [ 99 ] {} <Debug> TCPHandler: Processed in 180.760750757 sec.
2022.01.06 19:59:26.299626 [ 246 ] {} <Debug> DDLWorker: Initializing DDLWorker thread
2022.01.06 19:59:26.307408 [ 246 ] {} <Debug> DDLWorker: Initialized DDLWorker thread
2022.01.06 19:59:26.307555 [ 246 ] {} <Debug> DDLWorker: Scheduling tasks
2022.01.06 19:59:26.308295 [ 246 ] {} <Debug> DDLWorker: Will schedule 14 tasks starting from query-0000000000
2022.01.06 19:59:26.312075 [ 246 ] {} <Debug> DDLWorker: Will not execute task query-0000000000: Task has been already processed
2022.01.06 19:59:26.321915 [ 246 ] {} <Error> DNSResolver: Cannot resolve host (chi-repl-1s1r-replcluster-0-0.ch1.svc.cluster.local), error 0: chi-repl-1s1r-replcluster-0-0.ch1.svc.cluster.local.
2022.01.06 19:59:26.322200 [ 246 ] {} <Error> DDLWorker: Unexpected error, will try to restart main thread:: Code: 198. DB::Exception: Not found address of host: chi-repl-1s1r-replcluster-0-0.ch1.svc.cluster.local. (DNS_ERROR), Stack trace (when copying this message, always include the lines below):

0. DB::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int, bool) @ 0xa21959a in /usr/bin/clickhouse
1. ? @ 0xa2c77d1 in /usr/bin/clickhouse
2. ? @ 0xa2c7fa2 in /usr/bin/clickhouse
3. DB::DNSResolver::resolveAddress(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, unsigned short) @ 0xa2c8ea3 in /usr/bin/clickhouse
4. DB::HostID::isLocalAddress(unsigned short) const @ 0x12cff40b in /usr/bin/clickhouse
5. DB::DDLTask::findCurrentHostID(std::__1::shared_ptr<DB::Context const>, Poco::Logger*) @ 0x12d01f81 in /usr/bin/clickhouse
6. DB::DDLWorker::initAndCheckTask(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >&, std::__1::shared_ptr<zkutil::ZooKeeper> const&) @ 0x12d0ba46 in /usr/bin/clickhouse
7. DB::DDLWorker::scheduleTasks(bool) @ 0x12d0f106 in /usr/bin/clickhouse
8. DB::DDLWorker::runMainThread() @ 0x12d091e5 in /usr/bin/clickhouse
9. ThreadFromGlobalPool::ThreadFromGlobalPool<void (DB::DDLWorker::*)(), DB::DDLWorker*>(void (DB::DDLWorker::*&&)(), DB::DDLWorker*&&)::'lambda'()::operator()() @ 0x12d1d9d7 in /usr/bin/clickhouse
10. ThreadPoolImpl<std::__1::thread>::worker(std::__1::__list_iterator<std::__1::thread, void*>) @ 0xa25a3b7 in /usr/bin/clickhouse
11. ? @ 0xa25ddbd in /usr/bin/clickhouse
12. ? @ 0x7f273cf8e609 in ?
13. clone @ 0x7f273ceb5293 in ?
 (version 21.12.3.32 (official build))
2022.01.06 19:59:26.322243 [ 246 ] {} <Information> DDLWorker: Cleaned DDLWorker state
2022.01.06 19:59:28.727337 [ 102 ] {} <Debug> DiskLocal: Reserving 1.00 MiB on disk `default`, having unreserved 46.48 GiB.
2022.01.06 19:59:29.193515 [ 111 ] {} <Debug> DiskLocal: Reserving 1.00 MiB on disk `default`, having unreserved 46.48 GiB.
2022.01.06 19:59:29.260263 [ 110 ] {} <Debug> DiskLocal: Reserving 1.00 MiB on disk `default`, having unreserved 46.48 GiB.
2022.01.06 19:59:31.322339 [ 246 ] {} <Debug> DDLWorker: Initializing DDLWorker thread
2022.01.06 19:59:31.330814 [ 246 ] {} <Debug> DDLWorker: Initialized DDLWorker thread
2022.01.06 19:59:31.330851 [ 246 ] {} <Debug> DDLWorker: Scheduling tasks
2022.01.06 19:59:31.331512 [ 246 ] {} <Debug> DDLWorker: Will schedule 14 tasks starting from query-0000000000
2022.01.06 19:59:31.334344 [ 246 ] {} <Debug> DDLWorker: Will not execute task query-0000000000: Task has been already processed
2022.01.06 19:59:31.343967 [ 246 ] {} <Error> DNSResolver: Cannot resolve host (chi-repl-1s1r-replcluster-0-0.ch1.svc.cluster.local), error 0: chi-repl-1s1r-replcluster-0-0.ch1.svc.cluster.local.
2022.01.06 19:59:31.344184 [ 246 ] {} <Error> DDLWorker: Unexpected error, will try to restart main thread:: Code: 198. DB::Exception: Not found address of host: chi-repl-1s1r-replcluster-0-0.ch1.svc.cluster.local. (DNS_ERROR), Stack trace (when copying this message, always include the lines below):

0. DB::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int, bool) @ 0xa21959a in /usr/bin/clickhouse
1. ? @ 0xa2c77d1 in /usr/bin/clickhouse
2. ? @ 0xa2c7fa2 in /usr/bin/clickhouse
3. DB::DNSResolver::resolveAddress(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, unsigned short) @ 0xa2c8ea3 in /usr/bin/clickhouse
4. DB::HostID::isLocalAddress(unsigned short) const @ 0x12cff40b in /usr/bin/clickhouse
5. DB::DDLTask::findCurrentHostID(std::__1::shared_ptr<DB::Context const>, Poco::Logger*) @ 0x12d01f81 in /usr/bin/clickhouse
6. DB::DDLWorker::initAndCheckTask(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >&, std::__1::shared_ptr<zkutil::ZooKeeper> const&) @ 0x12d0ba46 in /usr/bin/clickhouse
7. DB::DDLWorker::scheduleTasks(bool) @ 0x12d0f106 in /usr/bin/clickhouse
8. DB::DDLWorker::runMainThread() @ 0x12d091e5 in /usr/bin/clickhouse
9. ThreadFromGlobalPool::ThreadFromGlobalPool<void (DB::DDLWorker::*)(), DB::DDLWorker*>(void (DB::DDLWorker::*&&)(), DB::DDLWorker*&&)::'lambda'()::operator()() @ 0x12d1d9d7 in /usr/bin/clickhouse
10. ThreadPoolImpl<std::__1::thread>::worker(std::__1::__list_iterator<std::__1::thread, void*>) @ 0xa25a3b7 in /usr/bin/clickhouse
11. ? @ 0xa25ddbd in /usr/bin/clickhouse
12. ? @ 0x7f273cf8e609 in ?
13. clone @ 0x7f273ceb5293 in ?
 (version 21.12.3.32 (official build))
Slach commented 2 years ago

Did you destroy / create zookeeper after re-install your CHI manifest?

ragsarang commented 2 years ago

I deleted zookeeper pods after reinstallation of chi manifest. Statefulset recreated zookeeper pods However clickhouse client has still the same error

dabula-s commented 2 years ago

any updates?

SolydBoy commented 1 year ago

Any updates here ? I'm facing the same problem

Slach commented 1 year ago

@SolydBoy which problem? do you have custom cluster domain? or you just can't run ON CLUSTER queries in clickhouse-operator managed clickhouse?

Mizoguchee commented 1 month ago

@SolydBoy @ragsarang Facing the same issue. . Did you get this fixed by any chance ?

Slach commented 1 month ago

@Mizoguchee could you explain which problem do you have?

Mizoguchee commented 1 month ago

@Slach I have done 3 shard cluster each shard has 2 replica It was working when I setup initially. Now i had to redo these instances but I'm stuck now

CREATE DATABASE IF NOT EXISTS newDB ON CLUSTER clickhouse_cluster;

Received exception from server (version 24.8.2): Code: 159. DB::Exception: Received from localhost:9000. DB::Exception: Distributed DDL task /clickhouse/task_queue/ddl/query-0000000009 is not finished on 3 of 6 hosts (0 of them are currently executing the task, 0 are inactive). They are going to execute the query in background. Was waiting for 900.089174443 seconds, which is longer than distributed_ddl_task_timeout. (TIMEOUT_EXCEEDED)

Increased thew timeout 900 on all three servers but no luck

The weird case is it works Each shard has a single replica, and all replicas are on port 9000 for their respective hosts. Whereas it times out when each shard has multiple replicas and replicas are on different ports like 9000 and 9999 on each server.

I Just dont know what happened.. It was working find the day when i configured the multiple shard and replicas

Below is my current setup

image


<remote_servers>
<clickhouse_cluster>
<shard>
<replica>
<host>100.10.0.51</host>
<port>9000</port>
</replica>
<replica>
<host>100.10.0.207</host>
<port>9999</port>
</replica>
</shard>

<shard>
<replica>
<host>100.10.0.207</host>
<port>9000</port>
</replica>
<replica>
<host>100.10.0.200</host>
<port>9999</port>
</replica>
</shard>

<shard>
<replica>
<host>100.10.0.200</host>
<port>9000</port>
</replica>
<replica>
<host>100.10.0.51</host>
<port>9999</port>
</replica>
</shard>
</clickhouse_cluster>
</remote_servers>
Slach commented 1 month ago

@Mizoguchee are you sure your clickhouse managed by clickhouse-operator could share kubectl get chi -n <your-namespace> <your-chi-name> -o yaml ?

Mizoguchee commented 1 month ago

@Slach Sorry Im using it in VM. Not K8s Below is my sample clickhouse keeper config

image

Slach commented 1 month ago

@Mizoguchee this repository about clickhouse-operator not about common question itself

check system.clusters table in each of 6 hosts your cluster should contains is_local=1 for your host

and ask in https://github.com/ClickHouse/ClickHouse/issues/

Mizoguchee commented 1 month ago

Yes it shows 1 for each node.. its not 6 hosts , its three hosts with two ports open. One port for Replica1 One port for Replica2