Consider running all replicated modifications in a transaction block

When performing individual DML commands, the router executor currently modifies and commits replicas one by one, rather than opening a transaction block on each replica and committing them after modification succeeds on all of them. While this has latency benefits in a few situations, it is not a sustainable model, complicates the code, and lowers reliability and uptime.

Currently, if we detect an application-level failure after the first replica has committed, there is no way to roll back and thus we must mark the replica that exhibited the failure as inactive, even if the same failure occurs on all remaining replicas. The subsequent cost of repairing the inactive placements is high, since writes are blocked while the placements are copied.

Consider a scenario in which the user manually adds a trigger with application-level error checks to shard 1 with placements A B and C. For some reason, placement A became inactive, the user then runs a shard repair operation for placement A, but neglects to also re-create the trigger. Then, a DML command is issued and succeeds on placement A, but the trigger fails on placements B and C. Now placements B and C, which are the "good" placements with the appropriate trigger, are marked as inactive, and only placement A, which is the "bad" placement without the trigger remains. In the end, the trigger logic is not honoured and an expensive repair operation is necessary.

Another scenario is when a network blip happens after a change was committed on a single replica. If connections to the other placements fail, they will have to be marked as inactive, otherwise they would be out-of-sync. In the end, an expensive repair operation is necessary.

Many scenarios like the ones described above exist and they would be prevented by rolling back the transaction on placement A after the failure on B by performing them in a transaction block that is committed in a 1PC or 2PC. Doing this would allow us to be much more conservative about changing shardstate, which can avoid unnecessary repairs or failures. Inactive shards and problems caused by them is a problem many Citus users have encountered.

The drawback of using a 1PC/2PC would be an additional round-trip for sending a commit to all the nodes after the write succeeded, but the commit step can be performed in parallel. In addition, this technique also allows us to perform update/delete/upsert and non-conflicting insert in parallel across the replicas, which actually lowers latency.

citusdata / citus

Consider running all replicated modifications in a transaction block #1032