stevendanna commented 2 years ago

We want to be able to compute the fingerprint for an entire tenant, based on the contents of that tenant.

The tenant fingerprint statement should:

[ ] Provide a hash whose input is the entire tenant keyspace between two timepoints. [ ] Be invariant to tenant ID [ ] Return enough data to recalculate the fingerprint on another tenant.

Proposed Fingerprint Calculation

The statement (syntax up for debate)

SHOW FINGERPRINT FOR TENANT $1 FROM $2 TO $3

where $1 is a tenant ID and where $2 and $3 are timestamps, would do the following:

The tenant span will be partitioned into sub spans that can be processed by processors on each node.
Each processor will calculate fnv64((strip_tenant_prefix(key), strip_checksum(value))...) for all KVs in their span.
Each processor will return (span, start_time, end_time, fingerprint) to the flows result writer.
The result writer will generate a fingerprint covering the entire tenant by sorting the returned results by span and calculating fnv64((span, fingerprint)...).

The statement would then return fingerprint, start_time, end_time, plan. Where the fingeprint is the value calculated in (4). The plan is a representation of the individual (span, fingerprint) results returned by the processors.

That plan can then be used in an alternate form of the statement:

SHOW FINGERPRINT FOR TENANT $1 FROM $2 TO $3 USING $4

in order to recalculate the checksum using the same span division.

Do we need to return the plan?

We could alternatively calculate a fingerprint in a way that is invariant to how the tenant span is partitioned. For instance, we could hash each KV individually, xor those keys on each processor and return the xor to the result writer and return an xor of the xors as the fingerprint. But we think there is some benefit of being able to narrow the location of an mismatch by inspecting the fingerprints of the subspans, perhaps using some alternate syntax like SHOW DETAILED FINGERPRINT FOR....

Jira issue: CRDB-20213 Epic CRDB-18750

blathers-crl[bot] commented 2 years ago

cc @cockroachdb/disaster-recovery

adityamaru commented 2 years ago

Backfilling some more discussion that occurred offline on the distributed tenant fingerprinting being a standalone job instead of being tied to the running streaming job:

Rough outline / whiteboarding:

User creates fingerprint job on secondary, specifying a particular tenant, timestamp/timebound, and the stream address of the primary.
The secondary reaches out to the primary and creates a fingerprint job.
The primary sets a protected timestamp at the timestamp it needs to fingerprint from/at.
The secondary reaches out to the primary for a "batch topology" that outlines the "batches" the primary is going to fingerprint. These batches will be created on the basis of the output of PartitionSpans on the primary.
In addition to returning the "batch topology" the primary will use the topology to start a distributed flow to fingerprint the batches and generate a checksum per batch.
In parallel, the secondary on receiving the "batch topology" will also start a similar distributed flow to fingerprint the batches specified in the topology.
The primary/secondary will periodically send progress updates to their respective jobs checkpointing the batch they have fingerprinted.
At some cadence (or in the end), the secondary will reach out to the primary to compare the fingerprints of each batch.

cockroachdb / cockroach

c2c: distributed tenant fingerprinting command #89355

Proposed Fingerprint Calculation

Do we need to return the plan?