cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.1k stars 3.81k forks source link

PCR: behavior when performing fast cutback from multiple standbys is possibly incorrect #131947

Open davidwding opened 1 month ago

davidwding commented 1 month ago

Describe the problem

Please describe the issue you observed, and any steps we can take to reproduce it:

Consider the following scenario with primary cluster A, with two standbys B and C:

  1. Start PCR from cluster A to two clusters B and C
  2. Complete cutover on clusters B and C
  3. Start PCR from B back to A using fast cutback, and complete cutover

At this point, what happens if we stop the service on A again and then attempt to start PCR from C back to A using fast cutback? Cluster A gets rewound to the point where cluster C ran cutover and then consumes changes from there.

However, after chatting with @dt, he thinks it's likely that this rewind may not quite be done correctly. David detailed the following sequence events:

Some testing/digging is the next step here.

To Reproduce

Steps to set up the scenario are detailed above.

Environment:

Additional context n/a

Jira issue: CRDB-42759

blathers-crl[bot] commented 1 month ago

cc @cockroachdb/disaster-recovery