citusdata / citus_docs

Documentation for Citus. Distributed PostgreSQL as an extension.
Creative Commons Attribution 4.0 International
58 stars 59 forks source link

Document connectivity check UDFs in 11.0 #1025

Open hanefi opened 2 years ago

hanefi commented 2 years ago

Why are we implementing it? (sales eng)

  1. Improved supportability.

Citus MX and (in general Citus) requires the nodes to be able to connect to each other. Citus now provides a simple UDF that provides the connectivity information across the whole cluster.

  1. Possible improvements in data planes

We can check for connectivity issues regularly in our managed database offerings. This can help detect network issues before they are reported by the customer via a ticket.

What are the typical use cases?

Communication goals (e.g. detailed howto vs orientation)

Good locations for content in docs structure

How does this work? (devs)

citus_check_cluster_node_health() returns setof records with names and ports for the source and target nodes, and the connectivity check result.

There are 3 possible result column values:

Example sql

This UDF can return many rows on large clusters. We suggest operators to run the following query that returns a single boolean first. If this one is true, all the checks were successful. Otherwise, the operator should look into the connectivity report on all the worker pairs.

SELECT bool_and(coalesce(result, false)) FROM citus_check_cluster_node_health();
 bool_and
---------------------------------------------------------------------
 f
(1 row)

For a quick summary, one can also use the following query to get the distinct counts of connectivity results in the cluster.

SELECT result, count(*) FROM citus_check_cluster_node_health() GROUP BY result ORDER BY 1;
 result | count
---------------------------------------------------------------------
 t      |     2
        |     2
(2 rows)

To check all the results for the connectivity checks, run the following:

SELECT * FROM citus_check_cluster_node_health() ORDER BY 1,2,3,4;
 from_nodename | from_nodeport | to_nodename | to_nodeport | result
---------------------------------------------------------------------
 localhost     |         57637 | localhost   |       57637 | t
 localhost     |         57637 | localhost   |       57638 | t
 localhost     |         57638 | localhost   |       57637 | 
 localhost     |         57638 | localhost   |       57638 | 

Corner cases, gotchas

It is risky to use the following query to check for an aggregate summary: SELECT bool_and(result) FROM citus_check_cluster_node_health(); because bool_and aggregate returns true even if some of the rows were NULL. Using coalesce(result, false) is crucial here.

Are there relevant blog posts or outside documentation about the concept/feature?

Link to relevant commits and regression tests if applicable

https://github.com/citusdata/citus/pull/5509

jonels-msft commented 2 years ago

https://github.com/citusdata/citus_docs/pull/1038 includes docs for citus_check_cluster_node_health() itself, but we're still lacking content in useful diagnostic queries and the cluster management page like you suggsted.