The tenant checksumming tool being built in https://github.com/cockroachdb/cockroach/issues/89355 requires that the user manually run the checksum command on both primary and standby clusters. Further, it requires that they do so at a time point where the entire time range they would like to checkpoint is available on both the primary and standby.
Cluster to cluster streaming could instead finger both clusters as the streaming frontier advances. On the standby (receiving cluster), another process would do the following:
Assume we have already verified all data up to the current frontier (t_0) on the standby and the primary and that they have a checksum c_0. And that we have a protected timestamp on t0.
When the frontier advances to t_1 on the standby, an asynchronous process can then start calculating the checkpoint from t_0 to t_1 (c_1) on the primary.
Once c_1 is calculated on the primary, the plan used on the primary is then used to calculate c_1 on the standby.
If the checkpoint matches, we store c_1 and t_1 as the new stored checksum.
The primary and secondary will then advance their protected timestamps from t0 to t1.
For large, existing tenants, calculating the initial checksum t_0 may still take considerable time. Some mitigations we might consider for this:
As the stream proceeds, we can calculate the checksum from t_1 to t_2 even if we haven’t calculated the checksum from t_0 to t_1. This may allow us to keep up with ongoing checksums even as the initial checksums complete.
We could provide an option to skip fingerprinting the initial scan. In the cluster to cluster use case, the majority of data over the life of the cluster will be data transferred after the initial scan.
The tenant checksumming tool being built in https://github.com/cockroachdb/cockroach/issues/89355 requires that the user manually run the checksum command on both primary and standby clusters. Further, it requires that they do so at a time point where the entire time range they would like to checkpoint is available on both the primary and standby.
Cluster to cluster streaming could instead finger both clusters as the streaming frontier advances. On the standby (receiving cluster), another process would do the following:
For large, existing tenants, calculating the initial checksum t_0 may still take considerable time. Some mitigations we might consider for this:
Jira issue: CRDB-20215
Epic CRDB-18750