Provide the ability to add custom health checks that precede and follow ClickHouse reconfiguration/upgrade

hodgesrm commented 1 month ago

We had a recent discussion with a user who has the following use case. They have an operator that watches for replication lag on ClickHouse and only schedules upgrade when it's below a certain level. They were wondering if the operator could help them implement logic that amount to "guard conditions" on operator upgrades. It could work as follows:

Permit a ClickHouse replica upgrade operation to start when replication lag is below a "reasonable" level.
Confirm that the replication lag is back to a reasonable level before moving to the next replica.

In this case the guard condition is replica lag but it's likely that there would be others. They and other users would presumably want to customize the conditions.

It seems as if this could be implemented in a general way by adding a custom health check section to CHI instances that contain one or more SQL queries to run before and after upgrade. If the check(s) pass, the upgrade could proceed. If they fail the operator would wait.

There are some questions with this kind of guard that we need to contemplate.

How would other services learn the status of the cluster health checks?
What happens if replica lag increases suddenly during an upgrade? It could block an incomplete operation indefinitely.
Does it just apply to upgrade or to any reconfiguration that causes a pod restart?
Should we report these health check results in our exported metrics? It would be important to see these visually in, say, Grafana when kicking off an upgrade or diagnosing problems.

sunsingerus commented 1 month ago

This is a very interesting topic that requires detailed discussion. However, these are my brief thoughts on the matter.

How would other services learn the status of the cluster health checks?

"Basic or k8s" health checks are run by k8s and clickhouse-operator does not know results of the health checks on-the-fly. These statuses can be fetched from each Pod, though, so "other services" can monitor Pods directly. List of clickhouse pods is provided by the clickhouse-operator in CHI.Status section
Some kind of "custom health checks" explicitly described in CHI manifests may be introduced, but this requires a good amount of design. Main questions to answer should be like "should SQL-only checks be allowed or custom shell commands as well" and "how to expose results of custom checks". As to results exposure, this can be done, for example, via REST endpoint, along with metrics being already exported by the operator. Speaking about metrics arises the question "do we need additional explicit health checks or metrics can be enhanced to serve this way?"

What happens if replica lag increases suddenly during an upgrade? It could block an incomplete operation indefinitely.

Clickhouse-operator has timeouts for host reconcile processes. Some additional or "general" timeout(s) should be introduced, if we are talking about waiting for external permission to reconcile a host.

Does it just apply to upgrade or to any reconfiguration that causes a pod restart?

There is no such a special case as "upgrade" at the moment, operator performs general reconcile. I'd rather avoid narrow cases and prefer to go with more generic approach. Introducing some kind of "external hooks" may be a proper way to go. Details are to be specified additionally.

Should we report these health check results in our exported metrics? It would be important to see these visually in, say, Grafana when kicking off an upgrade or diagnosing problems.

This question is the good starting point to think on how:

To enhance metrics functionality and introduce "custom metrics" that can be specified in CHI manifest and provide flexible mechanism to run complex tasks on the clickhouse host (having in mind questions like: SQL-only or to allow shell scripts as well)
Provide hooks (general external or to custom metrics only) that will control the operator's behavior during reconcile. Say to block host reconcile until specified metrics allows to proceed.

Anyway, this is a big topic to discuss and to plan, which we can start to.

ondrej-smola commented 1 month ago

Just a quick idea - option to launch k8s job from user provided template and wait for its completion before starting ch pod reconciliation (or some condition). We might augment this job with well-known env variables (such as pod to be reconciled, all cluster pods). Important advantage could be that we will move all the configuration (secrets, job.podFailurePolicies and others) to k8s and have build-in way to watch/debug it. Job could even push metrics/logs to collector during its runtime.

com6056 commented 3 weeks ago

👋 Hey folks! I'm an engineer over at Mux, the company asking y'all for this feature 🙌

Overall I think what y'all are saying makes sense, for some clarity I think we only need the ability to execute a single SQL statement before progressing onto the next pod to ensure our strict latency requirements are met.

For example, a query we'd want to run before a pod rolls is something like this (used a random value threshold since I'm not 100% sure what replication lag threshold we'd want to block progress on):

SELECT value FROM system.asynchronous_metrics WHERE metric = 'ReplicasMaxAbsoluteDelay' AND value > 5

I was thinking it could be implemented similar to how Prometheus alerts work, where any data returned means a failure: https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/

Whenever the alert expression results in one or more vector elements at a given point in time, the alert counts as active for these elements' label sets.

It looks like we already have some logic that runs SQL before it proceeds with the reconcile here: https://github.com/Altinity/clickhouse-operator/blob/d5f265fb4773ec4622b2af13fa52858c9f1e8c15/pkg/controller/chi/worker-chi-reconciler.go#L704

Under the hood, that calls https://github.com/Altinity/clickhouse-operator/blob/d5f265fb4773ec4622b2af13fa52858c9f1e8c15/pkg/model/chi/schemer/schemer.go#L151-L154 which runs https://github.com/Altinity/clickhouse-operator/blob/d5f265fb4773ec4622b2af13fa52858c9f1e8c15/pkg/model/chi/schemer/sql.go#L244-L246.

We could just add another function call right above/below w.completeQueries() to execute the query provided in the custom resource (if it exists) 🎉

I think all we really need is the ability to specify a single SQL statement in the custom resource, maybe a field like healthcheckSQL or something (I hate naming things, y'all probably have better ideas for a name haha), since you can just simply chain multiple statements together in the event you need to halt reconciles based on multiple sources of data. And then in the operator we would simply check if any data was returned from the query and if so, halt progress in the operator until no data is returned (or some external timeout is hit like mentioned in a previous comment).

Curious what y'all think of this!

alex-zaitsev commented 2 weeks ago

@com6056, thank you for your input. As an alternative, would it make sense to have this custom SQL as a part of the readiness probe? Or even a special condition to include host to the load balancer.

I am asking because syncing the replica may take quite a lot of time. Operator has no visibility what is going on when check is failing. What is a timeout value?

Alternatively, we may add replica wait logic to operator explicitly, without a custom SQL. In this case, operator will know what is going on. It will be able not just monitor the replica delay, but also check replication queue and see if there is any progress, or host is stuck.

com6056 commented 2 weeks ago

I think the latter idea sounds best with the operator being in full control of the rolling based on those metrics. Ideally we don't want pods rolling if the replication delay is above a certain amount and I don't believe a readiness probe halts rolling of pods with this setup (not 100% sure how it all works in your operator since you have a special StatefulSet setup), the probe only prevents the pod from being served traffic.

Altinity / clickhouse-operator

Provide the ability to add custom health checks that precede and follow ClickHouse reconfiguration/upgrade #1477