Altinity / clickhouse-operator

Altinity Kubernetes Operator for ClickHouse creates, configures and manages ClickHouse clusters running on Kubernetes
https://altinity.com
Apache License 2.0
1.85k stars 453 forks source link

Cluster service deleted on upgrade due to reconcile failure #1452

Open mbrancato opened 1 month ago

mbrancato commented 1 month ago

While performing an upgrade via Helm from 0.23.2 to 0.23.6, I ran across a problem where the cluster service disappeared. I also included a minor upgrade of the altinitystable image, but I don't think that is related.

The important bits in my CHI resource:

spec:
  defaults:
    templates:
      podTemplate: default-clickhouse-pod
      dataVolumeClaimTemplate: default-data-volume
      logVolumeClaimTemplate: default-log-volume
      clusterServiceTemplate: default-service-template
  configuration:
    settings:
      logger/level: information
    clusters:
      - name: events
        layout:
          shardsCount: 1
          replicasCount: 3
        secret:
          auto: "true"
  templates:
    serviceTemplates:
      - name: default-service-template
        generateName: clickhouse-{chi}
        metadata:
          annotations:
            cloud.google.com/load-balancer-type: "Internal"
            service.beta.kubernetes.io/aws-load-balancer-internal: "true"
            service.beta.kubernetes.io/azure-load-balancer-internal: "true"
            service.beta.kubernetes.io/openstack-internal-load-balancer: "true"
            service.beta.kubernetes.io/cce-load-balancer-internal-vpc: "true"
        spec:
          ports:
            - name: http
              port: 8123
            - name: tcp
              port: 9000
          type: LoadBalancer

When the operator upgraded, it appeared to get stuck attempting to convert clickhouse-events from a LoadBalancer to a ClusterIP. I believe this is somehow related to this commit that changes the default from LoadBalancer to ClusterIP. However, this CHI has always explicitly set the template to use LoadBalancer.

On startup, I saw this in the logs:

I0710 05:05:48.675757       1 service.go:86] CreateServiceCluster():foo/events/c857c1dd-66bf-4182-96eb-3e45f61664ee:foo/clickhouse-events
I0710 05:05:48.676889       1 worker-chi-reconciler.go:907] reconcileService():foo/events/c857c1dd-66bf-4182-96eb-3e45f61664ee:Service: foo/clickhouse-events not found. err: service "clickhouse-events" not found
I0710 05:05:48.840035       1 deleter.go:322] deleteServiceIfExists():foo/clickhouse-events:Not Found Service: foo/clickhouse-events err: services "clickhouse-events" not found
I0710 05:05:49.062109       1 worker.go:1480] createService():foo/events/c857c1dd-66bf-4182-96eb-3e45f61664ee:OK Create Service: foo/clickhouse-events
I0710 05:05:49.883043       1 worker-chi-reconciler.go:922] reconcileService():foo/events/c857c1dd-66bf-4182-96eb-3e45f61664ee:Service reconcile successful: foo/clickhouse-events

...

I0710 05:06:25.213119       1 worker-chi-reconciler.go:900] reconcileService():foo/events/c857c1dd-66bf-4182-96eb-3e45f61664ee:Service found: foo/clickhouse-events. Will try to update
E0710 05:06:25.213168       1 worker-chi-reconciler.go:914] reconcileService():foo/events/c857c1dd-66bf-4182-96eb-3e45f61664ee:Update Service: foo/clickhouse-events failed with error: just recreate the service in case of service type change 'LoadBalancer'=>'ClusterIP'
I0710 05:06:26.384478       1 deleter.go:329] deleteServiceIfExists():foo/clickhouse-events:OK delete Service: foo/clickhouse-events
E0710 05:06:26.584816       1 worker.go:1486] createService():foo/events/c857c1dd-66bf-4182-96eb-3e45f61664ee:FAILED Create Service: foo/clickhouse-events err: object is being deleted: services "clickhouse-events" already exists
E0710 05:06:27.422151       1 worker-chi-reconciler.go:928] reconcileService():foo/events/c857c1dd-66bf-4182-96eb-3e45f61664ee:FAILED to reconcile Service: foo/clickhouse-events CHI: events

It now appears to be recreated on a forced restart of the operator, and then a minute or so later, is deleted again. It won't be recreated until the operator restarts again.

I0710 05:16:25.276854       1 service.go:86] CreateServiceCluster():foo/events/a84245ce-80bc-4fb9-ad58-c69a77019a70:foo/clickhouse-events
I0710 05:16:25.278246       1 worker-chi-reconciler.go:907] reconcileService():foo/events/a84245ce-80bc-4fb9-ad58-c69a77019a70:Service: foo/clickhouse-events not found. err: service "clickhouse-events" not found
I0710 05:16:25.435511       1 deleter.go:322] deleteServiceIfExists():foo/clickhouse-events:Not Found Service: foo/clickhouse-events err: services "clickhouse-events" not found
I0710 05:16:25.805221       1 worker.go:1480] createService():foo/events/a84245ce-80bc-4fb9-ad58-c69a77019a70:OK Create Service: foo/clickhouse-events
I0710 05:16:26.468825       1 worker-chi-reconciler.go:922] reconcileService():foo/events/a84245ce-80bc-4fb9-ad58-c69a77019a70:Service reconcile successful: foo/clickhouse-events

...

I0710 05:17:26.904518       1 worker-chi-reconciler.go:900] reconcileService():foo/events/a84245ce-80bc-4fb9-ad58-c69a77019a70:Service found: foo/clickhouse-events. Will try to update
E0710 05:17:26.904648       1 worker-chi-reconciler.go:914] reconcileService():foo/events/a84245ce-80bc-4fb9-ad58-c69a77019a70:Update Service: foo/clickhouse-events failed with error: just recreate the service in case of service type change 'LoadBalancer'=>'ClusterIP'
I0710 05:17:28.073703       1 deleter.go:329] deleteServiceIfExists():foo/clickhouse-events:OK delete Service: foo/clickhouse-events
E0710 05:17:28.274057       1 worker.go:1486] createService():foo/events/a84245ce-80bc-4fb9-ad58-c69a77019a70:FAILED Create Service: foo/clickhouse-events err: object is being deleted: services "clickhouse-events" already exists
E0710 05:17:29.119358       1 worker-chi-reconciler.go:928] reconcileService():foo/events/a84245ce-80bc-4fb9-ad58-c69a77019a70:FAILED to reconcile Service: foo/clickhouse-events CHI: events 

Note: When it is creates, it is created correctly as a LoadBalancer, but then the second resource reconciliation attempts to make it a ClusterIP again.

Slach commented 1 month ago

Did you upgrade CRDs separatelly as described in https://github.com/Altinity/clickhouse-operator/blob/master/deploy/helm/clickhouse-operator/README.md?

mbrancato commented 1 month ago

@Slach I did not update the CRD. I have done so now, and it still is happening. Do I need to manually set a status.hostsUnchanged value in the CHI status?

% kubectl -n clickhouse get deploy chop-altinity-clickhouse-operator -o yaml | grep "image:"              
        image: altinity/clickhouse-operator:0.23.6
        image: altinity/metrics-exporter:0.23.6
% kubectl get crd clickhouseinstallations.clickhouse.altinity.com -o yaml | grep "clickhouse.altinity.com/chop"
    clickhouse.altinity.com/chop: 0.23.6
I0710 17:16:03.682328       1 service.go:86] CreateServiceCluster():foo/events/e425f007-f928-4649-92ed-3fb1598f1915:foo/clickhouse-events
I0710 17:16:03.683552       1 worker-chi-reconciler.go:907] reconcileService():foo/events/e425f007-f928-4649-92ed-3fb1598f1915:Service: foo/clickhouse-events not found. err: service "clickhouse-events" not found
I0710 17:16:03.850464       1 deleter.go:322] deleteServiceIfExists():foo/clickhouse-events:Not Found Service: foo/clickhouse-events err: services "clickhouse-events" not found
I0710 17:16:04.074642       1 worker.go:1480] createService():foo/events/e425f007-f928-4649-92ed-3fb1598f1915:OK Create Service: foo/clickhouse-events
I0710 17:16:04.882621       1 worker-chi-reconciler.go:922] reconcileService():foo/events/e425f007-f928-4649-92ed-3fb1598f1915:Service reconcile successful: foo/clickhouse-events

I0710 17:16:17.088227       1 worker-chi-reconciler.go:900] reconcileService():foo/events/e425f007-f928-4649-92ed-3fb1598f1915:Service found: foo/clickhouse-events. Will try to update
E0710 17:16:17.088280       1 worker-chi-reconciler.go:914] reconcileService():foo/events/e425f007-f928-4649-92ed-3fb1598f1915:Update Service: foo/clickhouse-events failed with error: just recreate the service in case of service type change 'LoadBalancer'=>'ClusterIP'
I0710 17:16:18.254178       1 deleter.go:329] deleteServiceIfExists():foo/clickhouse-events:OK delete Service: foo/clickhouse-events
E0710 17:16:18.453646       1 worker.go:1486] createService():foo/events/e425f007-f928-4649-92ed-3fb1598f1915:FAILED Create Service: foo/clickhouse-events err: object is being deleted: services "clickhouse-events" already exists
E0710 17:16:19.295985       1 worker-chi-reconciler.go:928] reconcileService():foo/events/e425f007-f928-4649-92ed-3fb1598f1915:FAILED to reconcile Service: foo/clickhouse-events CHI: events 
Service: foo/clickhouse-events
Service: foo/clickhouse-events
Service: foo/clickhouse-events
mbrancato commented 1 month ago

I tried adding a value into status.hostsUnchanged (that was the only change compared the the old CRD installed), and it made no difference. The CHOP is still constantly deleting the cluster service.

--- deploy/operatorhub/0.23.2/clickhouseinstallations.clickhouse.altinity.com.crd.yaml  2024-07-10 14:26:54
+++ deploy/operatorhub/0.23.6/clickhouseinstallations.clickhouse.altinity.com.crd.yaml  2024-07-10 14:26:54
@@ -4,14 +4,14 @@
 # SINGULAR=clickhouseinstallation
 # PLURAL=clickhouseinstallations
 # SHORT=chi
-# OPERATOR_VERSION=0.23.2
+# OPERATOR_VERSION=0.23.6
 #
 apiVersion: apiextensions.k8s.io/v1
 kind: CustomResourceDefinition
 metadata:
   name: clickhouseinstallations.clickhouse.altinity.com
   labels:
-    clickhouse.altinity.com/chop: 0.23.2
+    clickhouse.altinity.com/chop: 0.23.6
 spec:
   group: clickhouse.altinity.com
   scope: Namespaced
@@ -53,6 +53,11 @@
           type: string
           description: CHI status
           jsonPath: .status.status
+        - name: hosts-unchanged
+          type: integer
+          description: Unchanged hosts count
+          priority: 1 # show in wide view
+          jsonPath: .status.hostsUnchanged
         - name: hosts-updated
           type: integer
           description: Updated hosts count
@@ -172,6 +177,10 @@
                   nullable: true
                   items:
                     type: string
+                hostsUnchanged:
+                  type: integer
+                  minimum: 0
+                  description: "Unchanged Hosts count"
                 hostsUpdated:
                   type: integer
                   minimum: 0