ArroyoSystems / arroyo

Distributed stream processing engine in Rust
https://arroyo.dev
Apache License 2.0
3.81k stars 220 forks source link

Feature merge sink #705

Open zhuliquan opened 3 months ago

zhuliquan commented 3 months ago

fix issue #700 merge sinks with same table, it's helpful to save file handles or kafka connectors. Besides, it may fix bug two pipeline write data to same file. For example

CREATE TABLE cars (
        timestamp TIMESTAMP,
        driver_id BIGINT,
        event_type TEXT,
        location TEXT
) WITH (
        connector = 'single_file',
        path = 'cars.json',
        format = 'json',
        type = 'source'
);

CREATE TABLE cars_output (
        timestamp TIMESTAMP,
        driver_id BIGINT,
        event_type TEXT,
        location TEXT
) WITH (
        connector = 'single_file',
        path = 'cars_output.json',
        format = 'json',
        type = 'sink'
);
INSERT INTO cars_output SELECT * FROM cars WHERE driver_id = 100 AND event_type = 'pickup';
INSERT INTO cars_output SELECT * FROM cars WHERE driver_id = 101 AND event_type = 'dropoff';

if we don't merge sink node of two query, results of query1 will be overwrited by results of query2. Because there two sink node infer two file handle for same file (cars_output.json).