cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.07k stars 3.8k forks source link

system.replication_constraint_stats reporting incorrect violations #70024

Open smcvey opened 3 years ago

smcvey commented 3 years ago

After setting constraints on the RANGE default zone, the system.replication_constraint_stats incorrectly report system ranges which are in violation of the zone configuration.

To Reproduce

Create 9 nodes, with the following localities:

n1: --locality=region=region1,DC=dc1
n2: --locality=region=region1,DC=dc1
n3: --locality=region=region1,DC=dc1
n4: --locality=region=region1,DC=dc2
n5: --locality=region=region1,DC=dc2
n6: --locality=region=region1,DC=dc2
n7: --locality=region=region2,DC=dc3
n8: --locality=region=region2,DC=dc3
n9: --locality=region=region2,DC=dc3

Then run:

ALTER RANGE default CONFIGURE ZONE USING
num_replicas = 5,
constraints = '{+DC=dc1: 2, +DC=dc2: 2, +region=region2: 1}';

Populating any user-created database correctly replicates based on the constraints and there are therefore no entries in the system.replication_constraints_stats table.

However, ranges that belong to the system database do not conform to these constraints because they conform to the RANGE system zone configuration instead. However, when querying system.replication_constraints_stats, system ranges can appear in here as a violation, for example:

root@:26257/defaultdb> select * from system.replication_constraint_stats;
  zone_id | subzone_id |    type    |  config   | report_id |        violation_start        | violating_ranges
----------+------------+------------+-----------+-----------+-------------------------------+-------------------
        0 |          0 | constraint | +DC=dc1:2 |         1 | 2021-09-10 14:20:13.682032+00 |               20
        0 |          0 | constraint | +DC=dc2:2 |         1 | 2021-09-10 14:20:13.682032+00 |               15
(2 rows)

The system.replication_constraints_stats table should not be validating system ranges against default ranges. This table should not be populated as a result of the above zone configuration.

Verified on CRDB 21.1.6 and 21.1.8

Jira issue: CRDB-9905

Epic CRDB-32131

Lukens4242 commented 2 years ago

I now have a client that is experiencing this scenario and it is causing their internally written health/sanity checks to fail before various maint activities move forward.