apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.4k stars 2.43k forks source link

[SUPPORT] Best Practices for Setting Up DeltaStreamer Jobs for PostgreSQL Partitioned Tables with Debezium #11705

Open soumilshah1995 opened 3 months ago

soumilshah1995 commented 3 months ago

Hi everyone,

I’m working with a PostgreSQL table that uses a hash-based partitioning strategy. Here is the setup:

-- Create the main sales table
CREATE TABLE public.sales
(
    salesid SERIAL,
    invoiceid integer,
    itemid integer,
    category text,
    price numeric(10,2),
    quantity integer,
    orderdate date,
    destinationstate text,
    shippingtype text,
    referral text,
    updated_at TIMESTAMP DEFAULT NOW(),
    PRIMARY KEY (salesid, invoiceid)  -- Include the partitioning column in the primary key
) PARTITION BY HASH (invoiceid);

-- Create partitions
CREATE TABLE public.sales_part_0 PARTITION OF public.sales FOR VALUES WITH (MODULUS 4, REMAINDER 0);
CREATE TABLE public.sales_part_1 PARTITION OF public.sales FOR VALUES WITH (MODULUS 4, REMAINDER 1);
CREATE TABLE public.sales_part_2 PARTITION OF public.sales FOR VALUES WITH (MODULUS 4, REMAINDER 2);
CREATE TABLE public.sales_part_3 PARTITION OF public.sales FOR VALUES WITH (MODULUS 4, REMAINDER 3);

-- Insert data
INSERT INTO public.sales (invoiceid, itemid, category, price, quantity, orderdate, destinationstate, shippingtype, referral)
VALUES
    (101, 1, 'Electronics', 599.99, 2, '2023-11-21', 'California', 'Express', 'Friend'),
    (102, 3, 'Clothing', 49.99, 5, '2023-11-22', 'New York', 'Standard', 'OnlineAd'),
    (103, 2, 'Home & Garden', 199.50, 1, '2023-11-23', 'Texas', 'Express', 'WordOfMouth'),
    (104, 4, 'Books', 15.75, 3, '2023-11-24', 'Florida', 'Standard', 'SocialMedia'),
    (105, 2, 'Home & Garden', 199.50, 1, '2023-11-23', 'Texas', 'Express', 'WordOfMouth');

I have set up a Debezium connector with the following configuration:

name=PostgresConnector
connector.class=io.debezium.connector.postgresql.PostgresConnector
tasks.max=1
database.user=hive
database.dbname=metastore
database.hostname=metastore_db
database.password=hive
database.server.name=hive
table.include.list=public.sales
database.port=5432
plugin.name=pgoutput
tombstones.on.delete=false

This configuration creates the following topics: image

hive.public.sales hive.public.sales_part_0 hive.public.sales_part_1 hive.public.sales_part_2 hive.public.sales_part_3 I am looking for recommendations on how to set up DeltaStreamer jobs for each partition. What is the best approach to handle this setup effectively?

Thank you!

soumilshah1995 commented 3 months ago

i am assuming the answer would be to set up MultiTable Delta Streamer for each topic. I want to make sure this is the correct approach, or if there are any better recommendations

ad1happy2go commented 3 months ago

@soumilshah1995 Using MultiTable Delta Streamer would be best available option for now.

soumilshah1995 commented 3 months ago

wouldn't streamer aims to push in different target paths what if you want to build a one table and each hive.public.sales_part_0 hive.public.sales_part_1 hive.public.sales_part_2 hive.public.sales_part_3

becomes partitions

kind of like notion architecture

https://www.notion.so/blog/building-and-scaling-notions-data-lake image

How do we achieve something like this ?

soumilshah1995 commented 2 months ago

Any pointers | Suggestions

soumilshah1995 commented 2 months ago

I got this question answered and I will be trying this out

soumilshah1995 commented 1 month ago

I dont think approached mentioned on community sync will solve the issue I will attend next sync and ask my further question