enso-org / enso

Enso Analytics is a self-service data prep and analysis platform designed for data teams.
https://ensoanalytics.com
Apache License 2.0
7.39k stars 323 forks source link

Proposal: use Common Table Expressions to simplify generated queries #7983

Open radeusgd opened 1 year ago

radeusgd commented 1 year ago

Currently we are combining queries by including the 'ingredients' of a query as subqueries. That works, but in some cases may cause duplication - making the queries larger than necessary - thus harder to read for humans and optimize for DBs.

Motivating example:

t0 = connection.query "T0" # read some table
t1 = t0.set "X" "[Y]+100" . filter_by_expression "[Z] == 20 AND [W] LIKE '%A%'" # perform non trivial transformations
t2 = connection.query "T2"
t3 = connection.query "T3"

# use t1 twice in some queries
t12 = t1.join t2 on="X"
t13 = t1.join t3 on="Y"

# and then merge the result
result = t12.union t13

With the current design this will lead to a query like (simplified):

SELECT ... FROM (SELECT *, Y+100 AS X FROM T0 WHERE Z = 20 AND W LIKE ?) AS T1 JOIN (SELECT * FROM T2) AS T2 ON T1.X = T2.X
UNION ALL
SELECT ... FROM (SELECT *, Y+100 AS X FROM T0 WHERE Z = 20 AND W LIKE ?) AS T1 JOIN (SELECT * FROM T3) AS T3 ON T1.Y = T3.Y

As we can see the expression (which in some cases can be a much larger and more complicated query, possibly itself consisting of many subqueries) SELECT *, Y+100 AS X FROM T0 WHERE Z == 20 AND W LIKE ? AS T1 is duplicated in the subqueries.

This is undesirable. Of course the database optimizer may detect that the subqueries are the same, but it is less likely then if they just referred to the same CTE. Moreover, the query itself is larger, thus making it just more complex and also harder to read for humans. In fact, worst case scenario if we do something like:

t0 = connection.query "T0"
t1 = t0.union t0
t2 = t1.union t1
...
tN = t(N-1).union t(N-1)

the query size will grow exponentially (as will the result size). Whereas, if we optimize the subqueries using CTEs, the query will only grow linearly.


A proposed solution would be to modify the SQL generator to prefer CTEs over subqueries and ensure that common subtrees are compiled to a single CTE. Then the first example from above could be compiled to something like:

WITH T1 AS (SELECT *, Y+100 AS X FROM T0 WHERE Z == 20 AND W LIKE ?),
     T12 AS (SELECT ... FROM T1 JOIN (SELECT * FROM T2) AS T2 ON T1.X = T2.X),
     T13 AS (SELECT ... FROM T1 JOIN (SELECT * FROM T3) AS T3 ON T1.Y = T3.Y)
SELECT * FROM T12
UNION ALL
SELECT * FROM T13

Which is much more structured and easier to read - and includes the (possibly more complex) expression for T1 only once in the query.

radeusgd commented 1 year ago

Comparing query plans for the example query:

GregoryTravis commented 2 months ago

33b2ff7105b6ce95be3ba87ca805304de354496f Adds manual "let-binding"-style CTEs for manually simplifying generated SQL, and uses it to greatly reduce the size of the SQL generated for non-builtin round for backends that support nested WITH...AS queries. This is a good temporary measure and can be used freely whenever it improves query size, but it is not a sufficient solution.