citusdata / citus

Distributed PostgreSQL as an extension
https://www.citusdata.com
GNU Affero General Public License v3.0
10.67k stars 671 forks source link

Create a Citus healthcheck function/view for basic sanity checks #4276

Open thanodnl opened 4 years ago

thanodnl commented 4 years ago

As a distributed database Citus relies on the Citus catalog tables to be consistent with the shards existing on the workers. This is guaranteed by executing anything modifying this in distributed transactions.

Every once in a while we run into clusters where the metadata is not consistent with the physical data on the workers. Having a health check function that can quickly give an overview of everything the is not in a consistent state can be very beneficial during these debugging sessions to get a quick overview of every shard or any other distributed object being inconsistent on workers compared to the coordinator metadata.

With a function like this it could also be sampled by automation to get a view overtime to quickly diagnose when the cluster entered an inconsistent state. This could greatly improve the determination when the time of origin was.

The exact output of a function/view for this needs to be discussed and can take multiple forms. Ideally it would be as versatile as pg_stat_activity is for tracking the state of backends. It could output all tracked objects per worker. This will turn into a big view. The benefit is we can use SQL to quickly filter and analyse. Unfortunately it might be a function with a lot of network traffic which might not all be required for the final result. I don't think we can easily prune that down based on applied filters.

An example output could be:

object type fully qualified name worker status
shard public.mytable_100200 worker1 available
shard public.mytable_100201 worker 2 available
shard public.mytable_100202 worker 3 missing
type public.mytype worker 1 consistent
type public.mytype worker 2 inconsistent
type public.mytype worker 3 missing

Every tracked object needs to provide some functionality on verifying its state on every worker. We can start with shards and slowly increase the coverage for every supported object.

The view above could easily be grouped by worker and status to quickly guage the consistency per worker. It could also be filtered to only show inconsistent items etc.

We would want to standardize on a limited set (enum?) of status' for easy use in monitoring.

SaitTalhaNisanci commented 4 years ago

It could make sense to also check if any node in the cluster can connect to every other node for each user/certificate connection. If not, this might be a problem in repartition joins etc where a worker node needs to connect to another worker node.