citusdata / citus_docs

Documentation for Citus. Distributed PostgreSQL as an extension.
Creative Commons Attribution 4.0 International
58 stars 59 forks source link

Doc changes for MERGE #1082

Closed pinodeca closed 1 year ago

pinodeca commented 1 year ago

Why are we implementing it? (sales eng)

Postgres community released a new feature, in 15.0, that can conditionally insert, update, or delete rows of a table. MERGE performs actions that modify rows in the target table, using the data source. MERGE provides a single SQL statement that can conditionally INSERT, UPDATE or DELETE rows, a task that would otherwise require multiple procedural language statements, using UPSERT with ON CONFLICT clause etc. Sample merge command looks like below, where all modifications are done on the target table.

MERGE INTO target t
USING source s ON (t.customer_id = s.customer_id)
WHEN MATCHED AND t.order_center = 'value' THEN
 -- Remove the customer in target
DELETE
WHEN MATCHED THEN
-- Existing customer, update the order count and last_order_id in target
UPDATE SET    order_count = t.order_count + 1, last_order_id = s.order_id
WHEN NOT MATCHED THEN
-- Add the customer in target
INSERT (customer_id, last_order_id, order_center, order_count, last_order)
VALUES (customer_id, s.order_id, s.order_center, 123, s.order_time);

As you can see, this is a complex SQL statement, so we decided to support this in Citus in a phased manner.

What are the typical use cases?

Communication goals (e.g. detailed howto vs orientation)

Good locations for content in docs structure

How does this work? (devs)

MERGE on distributed tables is compiled, planned by the engine, and is routed to the worker nodes (where the data is present). The merge-modifications are done individually at the individual shard level, all this is done internally (and hidden) by the Citus engine.

Example sql

CREATE TABLE source
(
   order_id        INT,
   customer_id     INT,
   order_center    VARCHAR,
   order_time timestamp
);

CREATE TABLE target
(
   customer_id     INT,
   last_order_id   INT,
   order_center    VARCHAR,
   order_count     INT,
   last_order      timestamp
);
SELECT create_distributed_table('target', 'customer_id');
SELECT create_distributed_table('source', 'customer_id', colocate_with=>'target');
MERGE INTO target t
   USING source s
   ON (t.customer_id = s.customer_id)

   WHEN MATCHED AND t.order_center = 'XX' THEN
       DELETE

   WHEN MATCHED THEN
       UPDATE SET     -- Existing customer, update the order count and last_order_id
           order_count = t.order_count + 1,
           last_order_id = s.order_id

   WHEN NOT MATCHED THEN       -- New entry, record it.
       INSERT (customer_id, last_order_id, order_center, order_count, last_order)
           VALUES (customer_id, s.order_id, s.order_center, 123, s.order_time);

Corner cases, gotchas

Target | Source | Support | Comments -- | -- | -- | -- Local | Local | Yes |   Local | Distributed | No | TBD Local | Reference | No |   Distributed | Local | No | TBD. Distributed | Distributed | Yes | Only co-located tables. Distributed | Reference | In progress |   Reference | NA | No | Reference table as target is not allowed.

Trying a MERGE SQL on Citus reference table will result in ERROR: MERGE command is not supported on reference tables yet

Trying a MERGE SQL on distributed tables that are not colocated in will result in ERROR: For MERGE command, all the distributed tables must be colocated

Trying a MERGE SQL on distributed tables that are not joined on their respective distribution columns will result in ERROR: MERGE command is only supported when all distributed tables are co-located and joined on their distribution columns

Updating the distribution column in the MERGE actions is not allowed and will result in ERROR: updating the distribution column is not allowed in MERGE actions

Are there relevant blog posts or outside documentation about the concept/feature?

For more details, please look here: PostgreSQL: Documentation: 15: MERGE.

Link to relevant commits and regression tests if applicable

Phase-I: Support MERGE with local tables by tejeswarm · Pull Request #6365 · citusdata/citus (github.com) Phase-II: Support MERGE on distributed tables with restrictions by tejeswarm · Pull Request #6610 · citusdata/citus (github.com) Phase-III: This implements MERGE phase-III by tejeswarm · Pull Request #6696 · citusdata/citus (github.com)

pinodeca commented 1 year ago

@tejeswarm @onderkalaci Can you please fill in each of the template sections in the description above?