StarRocks / starrocks

StarRocks, a Linux Foundation project, is a next-generation sub-second MPP OLAP database for full analytics scenarios, including multi-dimensional analytics, real-time analytics, and ad-hoc queries.
https://starrocks.io
Apache License 2.0
8.65k stars 1.75k forks source link

关于SQL的查询条件进行常量替换下放到JOIN连接条件进行查询性能优化 #48360

Closed jinkun2014 closed 1 week ago

jinkun2014 commented 1 month ago

Feature request

Is your feature request related to a problem? Please describe.

img_v3_02cq_fe293a7a-65d0-4125-8542-85ca490cec1g

图中(EXPLAIN的结果,仅在这个EXPLAIN结果可靠且代表真实的执行过程)可以看出在join中,t1.UserId = t0.UserId, t2.UserId = t1.UserId,理论上可以推断出 t2.UserId = t0.UserId, 然后现有的SQL优化器并没有这么去推断,导致 t2 几乎扫描了所有的分区,而手动改写SQL之后, t2.UserId=t0.UserId, 只扫描了一个分区,如果能自动实现类似的推断进行查询条件的常量替换,那么对于类似的SQL查询将有很大的性能提升

Describe the solution you'd like

Describe alternatives you've considered

Additional context

chenminghua8 commented 1 month ago

It looks like a good optimization point, can someone assign this task to me?

chenminghua8 commented 1 month ago

Had fix by https://github.com/StarRocks/starrocks/pull/43346

jinkun2014 commented 1 month ago

@chenminghua8 大佬,这个已经实现了吗,为什么我这个最新版本3.3.0版本的执行计划没看到谓词下推呢?

LiShuMing commented 1 month ago

@jinkun2014 This bugfix should be contained in 3.3.0, maybe there is another bug. Can you give an example to reproduce this?

jinkun2014 commented 1 month ago

@LiShuMing example as :

-- test.t_amadb_product definition
DROP table t_amadb_product
CREATE TABLE `t_amadb_product` (
  `Id` bigint(20) NOT NULL COMMENT "主键Id",
  `UserId` int(11) NOT NULL COMMENT "用户主键Id",
  `AccountId` int(11) NOT NULL COMMENT "卖家账号Id",
  `Site` varchar(65533) NOT NULL COMMENT "站点",
  `Sku` varchar(65533) NULL COMMENT "卖家自己定义的商品标识"
) ENGINE=OLAP 
PRIMARY KEY(`Id`, `UserId`, `AccountId`, `Site`)
COMMENT "产品销量、库存、费用表"
PARTITION BY RANGE(`UserId`)
(PARTITION p1 VALUES [("-2147483648"), ("10")),
PARTITION p2 VALUES [("10"), ("20")),
PARTITION p3 VALUES [("20"), ("30")),
PARTITION p4 VALUES [("30"),("2147483647")))
DISTRIBUTED BY HASH(`Id`) BUCKETS 6 
ORDER BY(`UserId`, `AccountId`, `Site`, `Sku`)
PROPERTIES (
"compression" = "LZ4",
"datacache.enable" = "true",
"datacache.partition_duration" = "1 months",
"enable_async_write_back" = "false",
"enable_persistent_index" = "true",
"persistent_index_type" = "LOCAL",
"replication_num" = "1",
"storage_volume" = "builtin_storage_volume"
);

DROP table t_amzdb_orders_item_analysis_info
CREATE TABLE `t_amzdb_orders_item_analysis_info` (
  `Id` largeint(40) NOT NULL COMMENT "",
  `BelongDay` date NOT NULL COMMENT "",
  `UserId` int(11) NOT NULL COMMENT "",
  `AccountId` int(11) NOT NULL COMMENT "",
  `Site` varchar(65533) NOT NULL COMMENT "",
  `AmazonOrderId` varchar(65533) NULL COMMENT "",
  `Sku` varchar(65533) NULL COMMENT ""
) ENGINE=OLAP 
PRIMARY KEY(`Id`, `BelongDay`, `UserId`, `AccountId`, `Site`)
COMMENT "OLAP"
PARTITION BY date_trunc('month', BelongDay)
DISTRIBUTED BY HASH(`Id`)
ORDER BY(`UserId`, `AccountId`, `Site`, `BelongDay`)
PROPERTIES (
"compression" = "LZ4",
"datacache.enable" = "true",
"enable_async_write_back" = "false",
"enable_persistent_index" = "true",
"persistent_index_type" = "LOCAL",
"replication_num" = "1",
"storage_volume" = "builtin_storage_volume"
);

DROP table t_amzdb_orders_update_log
CREATE TABLE `t_amzdb_orders_update_log` (
  `Id` bigint(20) NOT NULL COMMENT "主键Id",
  `BelongDay` date NOT NULL COMMENT "归属日期",
  `UserId` int(11) NOT NULL COMMENT "用户主键Id",
  `AccountId` int(11) NOT NULL COMMENT "账号Id",
  `Site` varchar(65533) NOT NULL COMMENT "站点",
  `AmazonOrderId` varchar(65533) NOT NULL COMMENT "订单号"
) ENGINE=OLAP 
PRIMARY KEY(`Id`, `BelongDay`, `UserId`, `AccountId`)
COMMENT "单个订单每日增量更新记录表"
PARTITION BY date_trunc('month', BelongDay)
DISTRIBUTED BY HASH(`Id`)
ORDER BY(`UserId`, `AccountId`,`Site`, `BelongDay`)
PROPERTIES (
"compression" = "LZ4",
"datacache.enable" = "true",
"enable_async_write_back" = "false",
"enable_persistent_index" = "true",
"persistent_index_type" = "LOCAL",
"replication_num" = "1",
"storage_volume" = "builtin_storage_volume"
);

INSERT INTO test.t_amzdb_orders_update_log (Id, UserId, AccountId, Site, BelongDay, AmazonOrderId) VALUES(6751344676, 5, 11400139, 'us', '2024-05-29', '114-1100145-0637010');
INSERT INTO test.t_amzdb_orders_update_log (Id, UserId, AccountId, Site, BelongDay, AmazonOrderId) VALUES(6515727486, 15, 11400139, 'us', '2024-06-04', '111-5952273-4477057');

INSERT INTO test.t_amzdb_orders_item_analysis_info (Id, UserId, AccountId, Site, BelongDay, AmazonOrderId, Sku) VALUES(6751344676, 5, 11400139, 'us', '2024-05-29', '114-1100145-0637010','S1');
INSERT INTO test.t_amzdb_orders_item_analysis_info (Id, UserId, AccountId, Site, BelongDay, AmazonOrderId, Sku) VALUES(6515727486, 15, 11400139, 'us', '2024-06-04', '111-5952273-4477057','S2');

INSERT INTO test.t_amadb_product (Id, UserId, AccountId, Site, Sku) VALUES(6751344676, 5, 11400139, 'us','S1');
INSERT INTO test.t_amadb_product (Id, UserId, AccountId, Site, Sku) VALUES(6515727486, 15, 11400139, 'us','S2');

result1:

EXPLAIN
SELECT
    t0.Id AS LastId,
    t1.AmazonOrderId,
    t2.Sku
FROM
    t_amzdb_orders_update_log t0
INNER JOIN t_amzdb_orders_item_analysis_info t1 ON
    t1.UserId = t0.UserId
    AND t1.AccountId = t0.AccountId
    AND t1.Site = t0.Site
    AND t1.BelongDay = t0.BelongDay
    AND t1.AmazonOrderId = t0.AmazonOrderId
INNER JOIN t_amadb_product t2 ON
    t1.UserId = t2.UserId
    AND t1.Sku = t2.Sku
    AND t1.AccountId = t2.AccountId
    AND t1.Site = t2.Site
WHERE
    t0.UserId = 5
    AND t0.AccountId = 11400139
    AND t0.Site = 'us'
ORDER BY
    t0.Id ASC

image

result2:

EXPLAIN
SELECT
    t0.Id AS LastId,
    t1.AmazonOrderId,
    t2.Sku
FROM
    t_amzdb_orders_update_log t0
INNER JOIN t_amzdb_orders_item_analysis_info t1 ON
    t1.UserId = t0.UserId
    AND t1.AccountId = t0.AccountId
    AND t1.Site = t0.Site
    AND t1.BelongDay = t0.BelongDay
    AND t1.AmazonOrderId = t0.AmazonOrderId
INNER JOIN t_amadb_product t2 ON
    t0.UserId = t2.UserId
    AND t1.Sku = t2.Sku
    AND t0.AccountId = t2.AccountId
    AND t0.Site = t2.Site
WHERE
    t0.UserId = 5
    AND t0.AccountId = 11400139
    AND t0.Site = 'us'
ORDER BY
    t0.Id ASC

image

Because the simulated data has much fewer partitions than the real data, this example is for reference only. I believe you have more abundant test cases.

packy92 commented 1 week ago

image

In the main branch, the predicate had been pushed down. This pr is merged after branch-3.3 created, so the predicate cannot be pushed down in version-3.3.