duckdb / duckdb

DuckDB is an analytical in-process SQL database management system
http://www.duckdb.org
MIT License
23.2k stars 1.85k forks source link

Opitimizer bug for push down a comparison filter #10351

Closed bakey closed 2 months ago

bakey commented 8 months ago

What happens?

We create a table like this:

CREATE TABLE s111(location string, id integer, _default_time_index_ TIMESTAMP)

and use this sql to query

select * from s111 where location='cd' and (_default_time_index_ < CAST(now() AS TIMESTAMP) - INTERVAL '1h');

We expect we can generate a physical plan like this

┌─────────── ───┐
│         PROJECTION                         │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│          location                                 │
│             id                                        │
│    _default_time_index_               │
└─────────────┬ ┘                             
┌─────────────┴  ┐
│         SEQ_SCAN                             │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│            s111                                      │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│          location                                 │
│    _default_time_index_                │
│             id                                        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│  Filters: location=cd AND             │
│    location IS NOT NULL               │
│ _default_time_index_<2024       │
│  -01-26 05:02:21.235 AND          │
│ _default_time_index_ ...              │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│           EC: 1                                     │
└──────────────┘ 

Thas was say, the filter can be pushed down to the node of seq_scan.That is the most optimize plan. But the latest code output physical plan as below. There was an extra filter node, data that read from seq_scan had to go through one more node, which is not so optimization:

┌──────────────┐
│         PROJECTION                       │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│          location                                │
│             id                                      │
│    _default_time_index_             │
└─────────────┬┘                             
┌──────────────┐
│           FILTER                                  │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│  (_default_time_index_ <              │
│(CAST(now() AS TIMESTA...          │
│   '01:00:00'::INTERVAL))              │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│           EC: 1                                     │
└─────────────┬ ┘                             
┌──────────────┐
│         SEQ_SCAN                            │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│            s111                                     │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│          location                                 │
│    _default_time_index_               │
│             id                                        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│  Filters: location=cd AND             │
│    location IS NOT NULL               │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│           EC: 1                                    │
└──────────────┘ 

I found the root cause was that there exist a miscall function in the src/optimizer/filter_combiner.cpp.Which lead to the optimizer can not push the filter to the lowest node. More details can be refered the code change

To Reproduce

int main() {
    DuckDB db(nullptr);
    Connection con(db);
    con.Query("CREATE TABLE s111(location string, id integer, _default_time_index_ TIMESTAMP)");
    con.Query("INSERT INTO s111(location, id, _default_time_index_) VALUES ('cd', 10, '2023-10-11 20:00:00')");
    con.Query("INSERT INTO s111(location, id, _default_time_index_) VALUES ('cd', 20, '2023-10-11 21:00:00')");
    con.Query("INSERT INTO s111(location, id, _default_time_index_) VALUES ('cd', 30, '2024-10-11 21:00:00')");
    auto prepared = con.Prepare("select * from s111 where location='cd' and (_default_time_index_ < CAST(now() AS TIMESTAMP) - INTERVAL '1h');");
    cout << prepared->data->plan->ToString() << endl;

(edit:) SQL only:

CREATE TABLE s111(location string, id integer, _default_time_index_ TIMESTAMP);
INSERT INTO s111(location, id, _default_time_index_) VALUES ('cd', 10, '2023-10-11 20:00:00');
INSERT INTO s111(location, id, _default_time_index_) VALUES ('cd', 20, '2023-10-11 21:00:00');
INSERT INTO s111(location, id, _default_time_index_) VALUES ('cd', 30, '2024-10-11 21:00:00');
select * from s111 where location='cd' and (_default_time_index_ < CAST(now() AS TIMESTAMP) - INTERVAL '1h');

OS:

Ubuntu 22.04

DuckDB Version:

v0.9.1

DuckDB Client:

C++

Full Name:

YunhongPan

Affiliation:

Datalayers

Have you tried this on the latest main branch?

I have tested with a main build

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

doki23 commented 8 months ago

Could expression _default_time_index_ < CAST(now() AS TIMESTAMP) - INTERVAL '1h' really be pushed down to the scan node?

bakey commented 8 months ago

hi, @doki23 . Thanks for your reply. I think it is better to push all the filter to the scan node. Because we can use more information to help me skip some unnesscessary data reading.

github-actions[bot] commented 3 months ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 30 days.

github-actions[bot] commented 2 months ago

This issue was closed because it has been stale for 30 days with no activity.