ClickHouse / ClickHouse

ClickHouse® is a real-time analytics DBMS
https://clickhouse.com
Apache License 2.0
37.11k stars 6.85k forks source link

Optimize boolean conditions using CNF / DNF #11749

Open filimonov opened 4 years ago

filimonov commented 4 years ago

The problem

The table a more than hundred cols but only 3 of them are indexed, let call them PARTITION A ORDER BY B, C, D

The query use A B and D and another cols which is not indexed E. And it's look like

select * from T where
A = x and (
    (B = 1 and D = 2 and E = 3) or
    (B = 1 and D = 3 and E = 4) /* .... repeated a lot of times */
)

While the numbers of condition is small the query is quite fast (150ms), with 50 conditions the query start to take 10s and more.

Strangely if the wheres are re ordered like so :

select * from T where
A = x and (
    B = 1 and (
       D = 2 and (E = 3 or E=4 or E=5 /*.... repeated a lot of times */ ) OR
       D = 3 and (E = 3 or E=4 or E=5 /*.... repeated a lot of times */  ) OR
       /*.... repeated a lot of times */
   )
)

the query is sub seconds. From the boolean point of view the filter is the same but it seems clickhouse like that a way more than the first one.

Proposed solution

Introduce optimizer(s) to convert logical expression to conjunctive normal form (CNF) or disjunctive normal form (DNF)

CNF is smth like

(c1 = 2 OR c1 = 5) AND (c2 < 5 OR c2 > 20) AND (c3 = 'abc' OR c3 = 'efg')

DNF is smth like

(c1 = 2 AND c2 = 5) OR (c2 > 5 AND c2 < 10) OR c3 = 'abc'

Any logical expression has CNF / DNF form.

In some cases, CNF or DNF may be more preferable and both have more chances to use index than non-normalized conditions.

Extra options

During those optimizations, it's may be possible to detect always true / always false sub-conditions and remove them.

Some references in context of DBs

filimonov commented 4 years ago

Loosely relates #658

alexey-milovidov commented 3 years ago

BTW, @nikvas0 is interested in using constraint solvers for this task or similar.

CurtizJ commented 2 years ago

Convertation to CNF is implemented in #18787.

UnamedRus commented 2 years ago
EXPLAIN SYNTAX select * from (SELECT number AS A, number AS B, number AS C, number AS D, number AS E FROM numbers(10) ) where
                               A = 1 and (
                                   B = 1 and (
                                      D = 2 and (E = 3 or E=4 or E=5 /*.... repeated a lot of times */ ) OR
                                      D = 3 and (E = 3 or E=4 or E=5 /*.... repeated a lot of times */  ))
                               ) SETTINGS convert_query_to_cnf = 1;
┌─explain─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ SELECT                                                                                                                                      │
│     A,                                                                                                                                      │
│     B,                                                                                                                                      │
│     C,                                                                                                                                      │
│     D,                                                                                                                                      │
│     E                                                                                                                                       │
│ FROM                                                                                                                                        │
│ (                                                                                                                                           │
│     SELECT                                                                                                                                  │
│         number AS A,                                                                                                                        │
│         number AS B,                                                                                                                        │
│         number AS C,                                                                                                                        │
│         number AS D,                                                                                                                        │
│         number AS E                                                                                                                         │
│     FROM numbers(10)                                                                                                                        │
│     WHERE ((D = 2) OR (D = 3)) AND ((D = 2) OR (E IN (3, 4, 5))) AND ((D = 3) OR (E IN (3, 4, 5))) AND (B = 1) AND (E IN (3, 4, 5))         │
│ )                                                                                                                                           │
│ WHERE (E IN (3, 4, 5)) AND ((E IN (3, 4, 5)) OR (D = 2)) AND ((E IN (3, 4, 5)) OR (D = 3)) AND ((D = 2) OR (D = 3)) AND (A = 1) AND (B = 1) │
│ SETTINGS convert_query_to_cnf = 1                                                                                                           │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
WHERE
    (E IN (3, 4, 5)) 
AND ((E IN (3, 4, 5)) OR (D = 2)) 
AND ((E IN (3, 4, 5)) OR (D = 3)) 
AND ((D = 2) OR (D = 3)) 
AND (A = 1) AND (B = 1)