cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
29.84k stars 3.77k forks source link

server: allow users to run conformance reports for their schemas #100004

Open arulajmani opened 1 year ago

arulajmani commented 1 year ago

Is your feature request related to a problem? Please describe.

Background

Users can configure various properties, such as quorum size, number of non-voters, data placement etc. on schema objects. They can either do so directly, via zone configurations, or indirectly using multi-region abstractions (which are internally translated to zone configurations). However, the effects of such changes are asynchronous.

The zone configurations table lives in a tenant's keyspace. At a high level, once a zone configuration is committed, it must be converted to a SpanConfig and reconciled to KV (where it lives in the system.span_configurations table). Doing so entails hydrating the zone configuration (by walking up its inheritance chain) to convert it to a SpanConfig . This SpanConfig is then linked with the keyspan associated with the schema object, and persisted in KV using an RPC.

All KV nodes maintain an in-memory, incremental view over system.span_configurations. Once a KV node receives a SpanConfig update, ranges that overlap with the update's keyspans are pushed through various queues (e.g. Split, Merge, Replicate). It's these queues that are responsible for taking action to fulfill what user intention.

SpanConfigBounds

Until very recently, the async application of user specified configurations was only a matter of time. This changed with the introduction of SpanConfigBounds. SpanConfigBounds were motivated by a desire to disallow secondary tenants unfettered access to multi-region features (or zone configurations) in deployments where operators desire such control (read: serverless).

SpanConfigBounds allow operators the ability to declare bounds on (almost) all SpanConfig fields at a per-tenant level. These only work for secondary tenants. Operators can use SpanConfigBounds to override tenant reconciled span configurations by "clamping" any or all fields. For example, operators are able to do things like constrain a tenant to specific region(s) regardless of what the tenant requested. They can also do so retro-actively, after a tenant has successfully committed and reconciled such configurations.

Describe the solution you'd like

Arguably, users care more about when their data is in conformance, as opposed to to a promise that it eventually will be. With the introduction of SpanConfigBounds, tenants no longer have the latter either. This elevates the need to make point-in-time conformance easily observable.

Conveniently, we have a lot of pieces to provide such conformance reports already built. We just need to possibly enhance it, stitch things together, and provide a mechanism to consume such information. This issue asks to do exactly that.

Specifically, users should be able to run conformance reports that gives them information about which table(s)/index(es) are in violation of their zone configurations. There should also be 2 variations -- one that takes SpanConfigBounds in account and another that doesn't. This will allow users to discriminate between cases where all they need to do is "just wait" and cases that will never be satisfied.

High level sketch

We already have a conformance reporter. However, it doesn't give the caller a point-in-time snapshot -- this may need to change if we want stronger guarantees when stitching the report back with SQL state.

Note: this Reporter is not to be confused by the other Reporter in kvserver/reports/reporter.go. This latter construct is older, deprecated and we don't see too much future into it any more. It should be removed. (#100180)

The Reporter also doesn't know about SpanConfigBounds yet. We should extend it to return a list of SpanConfigs that fail bound checks in its response. This might simply be about giving the reporter a handle to the BoundsReader to get it access to a tenant's Bounds and calling Check() on it.

The tenant (SQL) is the only thing that has access to both:

  1. What timestamp its reconciled up till.
  2. How keyspans map back to schema objects (a reverse translation of sorts)

As such, the tenant would be responsible for taking the contents of a SpanConfigConformanceReport (which only associates raw keys to a conformance status) and mapping it back to which tables/indexes are in violation (if any).

I'm not sure what the best way to consume such information is -- maybe a new endpoint users can query? Or, better yet, we can build some sort of DB console page using? Alternatively, we could run such a thing periodically and maybe increment some metrics.

cc @ajwerner @knz

Jira issue: CRDB-26176

Epic CRDB-26686

As part of addressing this, we should make sure to delete FIXMEIDONTKNOWWHICHCODECTOUSE usage (introduced as part of #48123).

arulajmani commented 1 year ago

Thanks @ajwerner and @knz for brainstorming some of this stuff with me earlier this week. Feel free to add more thoughts I may have missed or not represented here.

knz commented 1 year ago

@zachlite for your interest. I believe that Andrew, Irfan and I would be delighted to brainstorm with you on this.

arulajmani commented 1 year ago

@knz I'm pulling off one of your edits into its own comment here:

There's also a bug by which the Reporter is unable to reason about logical keyspaces for secondary tenants; it's simply unaware of their schema and produces incoherent results for those ranges (this is tracked in #48123).

This wasn't part of what I had in mind when writing up the issue -- I don't think we need to push tenant schema information into the Reporter, which runs in KV, right?

knz commented 1 year ago

let me correct my edit