Altinity / clickhouse-operator

Altinity Kubernetes Operator for ClickHouse creates, configures and manages ClickHouse® clusters running on Kubernetes
https://altinity.com
Apache License 2.0
1.94k stars 467 forks source link

Adding an invalid label causes whole cluster to be removed #1420

Open alexvanolst opened 6 months ago

alexvanolst commented 6 months ago

Operator version: 0.23.5

Adding an invalid label to a podtemplate, eventually causes the operator to delete all statefulsets during reconciliation, regardless of settings.

I have the following settings:

    runtime:
        reconcileCHIsThreadsNumber: 10
        reconcileShardsThreadsNumber: 5
        reconcileShardsMaxConcurrencyPercent: 50
        threadsNumber: 0
    statefulSet:
        create:
            onFailure: abort
        update:
            timeout: 300
            pollInterval: 5
            onFailure: rollback
    host:
        wait:
            exclude: "true"
            queries: "true"
            include: "false"

After adding an invalid label to spec.templates.podTemplates[0].metadata.label e.g. some_bad_label: '/metrics' the operator tries to recreate the statefulsets but encounters the following error:

E0508 13:36:35.377188       1 creator.go:46] createStatefulSet():StatefulSet create failed. err: StatefulSet.apps "chi-clickhouse-store-0-0" is invalid: spec.template.labels: Invalid value: "/metrics": a valid label must be an empty string or consist of alphanumeric characters, '-', '_' or '.', and must start and end with an alphanumeric character (e.g. 'MyValue',  or 'my_value',  or '12345', regex used for validation is '(([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])?')

My expected behavior is: After failing to create the statefulset the operator either aborts or rolls back

Actual behavior: After some time period, the operator moves to the next statefulset until all are deleted (and not recreated due to error)

sunsingerus commented 6 months ago

Please check these behaviors: https://github.com/Altinity/clickhouse-operator/blob/bbbf66a8e0fbbcf36b787a63eceeaca37e0ec272/config/config.yaml#L256

sunsingerus commented 6 months ago

Try to modify

update:
            onFailure: rollback

to

update:
            onFailure: abort
sunsingerus commented 6 months ago

rollback needs to be checked

alexvanolst commented 6 months ago

@sunsingerus

I checked this with

update:
            onFailure: abort

and I still get the exact same behavior. After ~15mins it continues.