Closed fjetter closed 6 months ago
regardless of whether this is correct or not, this query breaks the optimizer, see https://github.com/dask/dask-expr/issues/965
so, this rewrite does not actually work because the initial join just blows up without the p_brand
and p_container
filters. We're currently checking whether pushing those filters down is actually viable. The subselect in the where clause may actually also apply to the external filters. If that's the case, our dataframe code is actually invalid and we can move forward with a more streamlined implementation.
@hendrikmakait is checking the SQL specs right now
TL;DR: I've double-checked and pushing the filters into the subquery is correct.
More context: Logically, you can think of the correlated subquery as a macro that's executed once per row of the outer query. Since we are only interested in the results of the filtered data, it suffices to execute them once per row that passes the filter predicates. This can be rewritten as a join with the filtered table. (See https://duckdb.org/2023/05/26/correlated-subqueries-in-sql.html for a nice introduction.)
If that's the case, our dataframe code is actually invalid and we can move forward with a more streamlined implementation.
Technically, it's not invalid but just less efficient.
I'm not sure if we can detect this automatically. I think this rewrite is only valid assuming l_partkey
is a primary key, isn't it? This is knowledege we just don't have in dask 🤔
Yeah you are correct, sorry, I got confused
I'm not sure if we can detect this automatically. I think this rewrite is only valid assuming l_partkey is a primary key, isn't it? This is knowledege we just don't have in dask 🤔
This is less about PK-FK relations but about the subquery correlation (which we definitely don't have right now).
I was reviewing the implementation for Q17 and was a little surprised. This is actually a pretty heavy query that, in it's current form, requires us to shuffle lineitems multiple times.
However, if I take the sql in the doc string literally, I should only be required to compute this on a filtered linetime (namely, the one after the join, i.e. where
l_partkey = p_partkey
).There is a caveat with this... it looks like the optimizer is either never converging or is just taking ages. Even for a scale1 dataset this is simplifying forever