StarRocks / starrocks

StarRocks, a Linux Foundation project, is a next-generation sub-second MPP OLAP database for full analytics scenarios, including multi-dimensional analytics, real-time analytics, and ad-hoc queries.
https://starrocks.io
Apache License 2.0
8.66k stars 1.75k forks source link

Out of order INSERT INTO ORDER BY with primary key and unique key tables #47072

Open ss892714028 opened 3 months ago

ss892714028 commented 3 months ago

Steps to reproduce the behavior (Required)

You can also reproduce this with unique key table.

-- Create the source table
CREATE TABLE db_stage_DEVELOP.source_table (
    hash BIGINT,
    id INT,
    timestamp DATETIME
)
PRIMARY KEY (hash)
DISTRIBUTED BY HASH(hash);

-- Create the destination table
CREATE TABLE db_stage_DEVELOP.destination_table (
    id INT,
    timestamp DATETIME,
    hash BIGINT
) 
PRIMARY KEY (id)
DISTRIBUTED BY HASH(id);

-- Insert 5 rows into the source table
INSERT INTO db_stage_DEVELOP.source_table (hash, id, timestamp) VALUES
(1234567890, 1, '2024-06-14 10:00:00'),
(2234567890, 1, '2024-06-15 11:00:00'),
(3234567890, 1, '2024-06-16 12:00:00'),
(4234567890, 1, '2024-06-17 13:00:00'),
(5234567890, 1, '2024-06-18 14:00:00');

-- Verify the data in the source table
SELECT * FROM db_stage_DEVELOP.source_table;

-- Select from the source table ordered by timestamp ascending and insert into the destination table
INSERT INTO db_stage_DEVELOP.destination_table (id, timestamp, hash)
SELECT id, timestamp, hash
FROM db_stage_DEVELOP.source_table
ORDER BY timestamp ASC;

-- Verify the data in the destination table
SELECT * FROM db_stage_DEVELOP.destination_table;

Expected behavior (Required)

If you make the source table single tablet, also add a paraphrasis to the SELECT statement inside of the INSERT, the issue goes away..

-- Create the source table
CREATE TABLE db_stage_DEVELOP.source_table (
    hash BIGINT,
    id INT,
    timestamp DATETIME
)
PRIMARY KEY (hash)
DISTRIBUTED BY HASH(hash) buckets 1;

-- Create the destination table
CREATE TABLE db_stage_DEVELOP.destination_table (
    id INT,
    timestamp DATETIME,
    hash BIGINT
) 
PRIMARY KEY (id)
DISTRIBUTED BY HASH(id);

-- Insert 5 rows into the source table
INSERT INTO db_stage_DEVELOP.source_table (hash, id, timestamp) VALUES
(1234567890, 1, '2024-06-14 10:00:00'),
(2234567890, 1, '2024-06-15 11:00:00'),
(3234567890, 1, '2024-06-16 12:00:00'),
(4234567890, 1, '2024-06-17 13:00:00'),
(5234567890, 1, '2024-06-18 14:00:00');

-- Verify the data in the source table
SELECT * FROM db_stage_DEVELOP.source_table;

-- Select from the source table ordered by timestamp ascending and insert into the destination table
INSERT INTO db_stage_DEVELOP.destination_table (id, timestamp, hash) (SELECT id, timestamp, hash FROM db_stage_DEVELOP.source_table ORDER BY timestamp ASC );

SELECT * FROM db_stage_DEVELOP.destination_table; 

Real behavior (Required)

SELECT * FROM db_stage_DEVELOP.destination_table;

different result every single time

StarRocks version (Required)

3.2.8 allinone ubuntu docker

luohaha commented 3 months ago

It’s not a bug, we have multi insert concurrency and try to insert into db_stage_DEVELOP.destination_table , last arrival win. And we cannot guarantee the order of arrival of data with the same key in a multi-concurrency scenario, unless you use set pipeline_dop = 1 , to make sure only one concurrency to handle insert.

ss892714028 commented 3 months ago

@luohaha Honestly, I think this is something StarRocks' planner can handle internally. Maybe something to consider.

jaogoy commented 3 months ago

@ss892714028 So, what you need is similar to merge_condition in https://docs.starrocks.io/docs/3.3/loading/Load_to_Primary_Key_tables/#conditional-updates, right?

jaogoy commented 3 weeks ago

@ss892714028 Have you solved your requirement?

JasonDung commented 5 days ago

merge_condition 这个属性可否加入到表属性中,这样的话,根据上述例子指定merge_condition='timestamp' 用sql做ETL时,就不会出现乱序的问题了。在官方文档中,我只看到在导入数据中增加merge_condition该属性,而使用SQL做ETL的场景却不知道该如何确定顺序,期待回复,谢谢。