Open hanefi opened 2 years ago
https://github.com/citusdata/citus_docs/pull/1038 includes docs for citus_check_cluster_node_health()
itself, but we're still lacking content in useful diagnostic queries and the cluster management page like you suggsted.
Why are we implementing it? (sales eng)
Citus MX and (in general Citus) requires the nodes to be able to connect to each other. Citus now provides a simple UDF that provides the connectivity information across the whole cluster.
We can check for connectivity issues regularly in our managed database offerings. This can help detect network issues before they are reported by the customer via a ticket.
What are the typical use cases?
Communication goals (e.g. detailed howto vs orientation)
Good locations for content in docs structure
Useful Diagnostic Queries
How does this work? (devs)
citus_check_cluster_node_health()
returns setof records with names and ports for the source and target nodes, and the connectivity check result.There are 3 possible result column values:
true
: connection attempt from source to target succeeded.false
: connection attempt from source to target failed.NULL
: connection attempt from the current node to source node failed.Example sql
This UDF can return many rows on large clusters. We suggest operators to run the following query that returns a single boolean first. If this one is true, all the checks were successful. Otherwise, the operator should look into the connectivity report on all the worker pairs.
For a quick summary, one can also use the following query to get the distinct counts of connectivity results in the cluster.
To check all the results for the connectivity checks, run the following:
Corner cases, gotchas
It is risky to use the following query to check for an aggregate summary:
SELECT bool_and(result) FROM citus_check_cluster_node_health();
because bool_and aggregate returns true even if some of the rows were NULL. Usingcoalesce(result, false)
is crucial here.Are there relevant blog posts or outside documentation about the concept/feature?
Link to relevant commits and regression tests if applicable
https://github.com/citusdata/citus/pull/5509