Enable shard replication in MX

marcocitus commented 7 years ago

We currently use shard resource locks on the coordinator node to guarantee replicas remain consistent and to prevent deadlock that could result from running concurrent multi-shard commands. However, having these locks on the coordinator prevents us from performing replicated (reference table) writes or multi-shard commands from workers on MX tables, including writes to reference tables and INSERT..SELECT commands, which harms the MX experience. It also causes issue #925.

A way to resolve this would be to move those locks to the workers that store the shards, either by introducing a UDF for taking the advisory lock or by using explicit table locks on the shards. These would be sent prior to issuing the multi-shard command. The locks need to be obtained sequentially and in a consistent order to avoid distributed deadlocks, after which the actual commands can be sent in parallel.

An alternative approach is to always route unsupported commands through the coordinator. This could also work for DDL commands. The workers will have to obtain a coordinator endpoint to which to send the commands.

onderkalaci commented 3 years ago

Supporting replicated tables on MX is almost similar to supporting reference tables on MX. We always use 2PC, serialize modifications and make sure that citus_disable/activate_node() UDFs gracefully handles replicated tables.

To support replicated tables on MX, I suggest the following:

[x] Drop support for 1PC
- [x] Drop support for 1PC on citus.multi_shard_commit_protocol #5379
- [x] Drop support for 1PC on citus.single_shard_commit_protocol #5380
- [x] Drop SHARD_STATE_INACTIVE. Citus never marks any placement as INACTIVE anymore, everything is done via 2PC if it involves multiple-placements/nodes (#5381).
  - [x] Executor changes
  - [x] Transaction hook changes
  - [x] Any other places that we treat differently for reference tables?
[x] Allow replication factor > 1 tables on MX #5392
[x] Serialize all the modifications to replicated tables as we do for reference tables by expanding SerializeNonCommutativeWrites()
- [x] Executor changes
- [x] Truncate trigger changes
- [x] Rebalancer changes
- [x] Any other places that we treat differently for reference tables?
[x] citus_disable_node deletes the placements on the node similar to reference table placements
- [x] When metadata synced, we expect all nodes to be up. We might need to relax this
  - [x] citus_activate_node does nothing regarding (under) replicated tables. Instead, the user should call rebalancer to make sure tables are replicated fine
- [x] We should make sure that replicate_table_shards works fine when some placements are under replicated. And, evenly distributes the under replicated placements to the new nodes.
- [x] As a semi-related question, what happens when a node is down and we use citus_activate_node on another node, in the MX world? This should work fine.
  - [x] Ignore replication model for distributed tables as at this point, s vs c doesn't reflect any difference.

onderkalaci commented 2 years ago

I feel confident enough to close the issue. The involved PRs are: #5379, #5380, #5381, #5386, #5392, #5405, #5476, #5469, #5470 and #5486.

For the remaining improvements, we could track via individual issues

citusdata / citus

Enable shard replication in MX #1033