Altinity / clickhouse-operator

Altinity Kubernetes Operator for ClickHouse creates, configures and manages ClickHouse clusters running on Kubernetes
https://altinity.com
Apache License 2.0
1.85k stars 452 forks source link

Re-Creating node from scratch does not copy tables for the Postgres and Kafka engines #1455

Open Hubbitus opened 1 month ago

Hubbitus commented 1 month ago

We use your Operator to manage Clickhouse cluster. Thank you.

After some hardware failure we reset PVC (and zookeeper namespace) to re-create one clickhouse node.

Most of metadata like views, materialized views and tables with most engines (MergeTree, ReplicatedMergeTree etc.) was successfully re-created on the node and replication was started.

Meantime none of Postgres and Kafka based engines tables was recreated. Is it a bug, or we need to use some commands or hacks to sync all metadata across the cluster?

alex-zaitsev commented 1 month ago

@Hubbitus , have you used latest 0.23.6 or earlier release?

Hubbitus commented 1 month ago

@alex-zaitsev, thank you for the response.

That was in older version. Now we have updated operator. What is a correct way to re-init node? Is it enough to just delete PVC of failed node and delete POD?

alex-zaitsev commented 1 month ago

@Hubbitus , if you want to re-init the existing node, delete STS, PVC, PV and start re-concile. Do you have multiple replicas?

Hubbitus commented 1 month ago

@alex-zaitsev, thank you for the reply.

I understand how to delete objects. But what you are meant under "start re-concile"?

I have two replicas chi-gid-gid-0-0-0 and chi-gid-gid-0-1-0. And now chi-gid-gid-0-0-0 is misfunction. I want to re-init it from the data in chi-gid-gid-0-1-0. And that should include sync all:

alex-zaitsev commented 3 weeks ago

@Hubbitus , we have released 0.23.7 that is more aggressive re-creating the schema. So you may try to delete PVC/PV completely, and let it to re-create the objects.

Hubbitus commented 1 day ago

@alex-zaitsev, thank you very much! Eventually I get it updated for our cluster:

kub_dev get pods --all-namespaces -o jsonpath="{.items[*].spec['initContainers', 'containers'][*].image}" -l app=clickhouse-operator                                                                                                     
altinity/clickhouse-operator:0.23.7 altinity/metrics-exporter:0.23.7

And doing in ArgoCD:

Then PVC had been re-created.

I see pod is up and running.

  1. But there are a lot of errors like 2024.09.04 23:50:34.382651 [ 712 ] {} <Error> Access(user directories): from: 10.42.9.104, user: data_quality: Authentication failed: Code: 192. DB::Exception: There is no userdata_qualityin local_directory. (UNKNOWN_USER).... So, users are not copied
  2. Tables looks like also not synced:
    SELECT hostname() as node, COUNT(*)
    FROM clusterAllReplicas('{cluster}', system.tables)
    WHERE database NOT IN ('INFORMATION_SCHEMA', 'information_schema', 'system')
    GROUP BY node
node count()
chi-gid-gid-0-1-0 620

And also error in log like: 2024.09.04 23:52:49.039132 [ 714 ] {bb628508-db8e-4cf9-8307-a13133a185c9} <Error> PredefinedQueryHandler: Code: 60. DB::Exception: Table system.operator_compatible_metrics does not exist. (UNKNOWN_TABLE) - so even in system database some tables missing...

So, I see only tables in information_schema for the 1-st node.