Open stevendanna opened 2 years ago
cc @cockroachdb/disaster-recovery
Backfilling some more discussion that occurred offline on the distributed tenant fingerprinting being a standalone job instead of being tied to the running streaming job:
Rough outline / whiteboarding:
User creates fingerprint job on secondary, specifying a particular tenant, timestamp/timebound, and the stream address of the primary.
The secondary reaches out to the primary and creates a fingerprint job.
The primary sets a protected timestamp at the timestamp it needs to fingerprint from/at.
The secondary reaches out to the primary for a "batch topology" that outlines the "batches" the primary is going to fingerprint. These batches will be created on the basis of the output of PartitionSpans
on the primary.
In addition to returning the "batch topology" the primary will use the topology to start a distributed flow to fingerprint the batches and generate a checksum per batch.
In parallel, the secondary on receiving the "batch topology" will also start a similar distributed flow to fingerprint the batches specified in the topology.
The primary/secondary will periodically send progress updates to their respective jobs checkpointing the batch they have fingerprinted.
At some cadence (or in the end), the secondary will reach out to the primary to compare the fingerprints of each batch.
We want to be able to compute the fingerprint for an entire tenant, based on the contents of that tenant.
The tenant fingerprint statement should:
[ ] Provide a hash whose input is the entire tenant keyspace between two timepoints. [ ] Be invariant to tenant ID [ ] Return enough data to recalculate the fingerprint on another tenant.
Proposed Fingerprint Calculation
The statement (syntax up for debate)
where $1 is a tenant ID and where $2 and $3 are timestamps, would do the following:
fnv64((strip_tenant_prefix(key), strip_checksum(value))...)
for all KVs in their span.(span, start_time, end_time, fingerprint)
to the flows result writer.fnv64((span, fingerprint)...)
.The statement would then return
fingerprint, start_time, end_time, plan
. Where the fingeprint is the value calculated in (4). The plan is a representation of the individual (span, fingerprint) results returned by the processors.That plan can then be used in an alternate form of the statement:
in order to recalculate the checksum using the same span division.
Do we need to return the plan?
We could alternatively calculate a fingerprint in a way that is invariant to how the tenant span is partitioned. For instance, we could hash each KV individually, xor those keys on each processor and return the xor to the result writer and return an xor of the xors as the fingerprint. But we think there is some benefit of being able to narrow the location of an mismatch by inspecting the fingerprints of the subspans, perhaps using some alternate syntax like
SHOW DETAILED FINGERPRINT FOR...
.Jira issue: CRDB-20213 Epic CRDB-18750