Rewrite q17 - Githubissues

coiled / benchmarks

BSD 3-Clause "New" or "Revised" License

28 stars 17 forks source link

Rewrite q17 #1451

Closed fjetter closed 6 months ago

fjetter commented 6 months ago

I was reviewing the implementation for Q17 and was a little surprised. This is actually a pretty heavy query that, in it's current form, requires us to shuffle lineitems multiple times.

However, if I take the sql in the doc string literally, I should only be required to compute this on a filtered linetime (namely, the one after the join, i.e. where l_partkey = p_partkey).

analyze-to_frame-0f13b17cb06f0df5a27610b462c448eb

There is a caveat with this... it looks like the optimizer is either never converging or is just taking ages. Even for a scale1 dataset this is simplifying forever

fjetter commented 6 months ago

regardless of whether this is correct or not, this query breaks the optimizer, see https://github.com/dask/dask-expr/issues/965

fjetter commented 6 months ago

so, this rewrite does not actually work because the initial join just blows up without the p_brand and p_container filters. We're currently checking whether pushing those filters down is actually viable. The subselect in the where clause may actually also apply to the external filters. If that's the case, our dataframe code is actually invalid and we can move forward with a more streamlined implementation.

@hendrikmakait is checking the SQL specs right now

hendrikmakait commented 6 months ago

TL;DR: I've double-checked and pushing the filters into the subquery is correct.

More context: Logically, you can think of the correlated subquery as a macro that's executed once per row of the outer query. Since we are only interested in the results of the filtered data, it suffices to execute them once per row that passes the filter predicates. This can be rewritten as a join with the filtered table. (See https://duckdb.org/2023/05/26/correlated-subqueries-in-sql.html for a nice introduction.)

If that's the case, our dataframe code is actually invalid and we can move forward with a more streamlined implementation.

Technically, it's not invalid but just less efficient.

fjetter commented 6 months ago

I'm not sure if we can detect this automatically. I think this rewrite is only valid assuming l_partkey is a primary key, isn't it? This is knowledege we just don't have in dask 🤔

phofl commented 6 months ago

Yeah you are correct, sorry, I got confused

hendrikmakait commented 6 months ago

I'm not sure if we can detect this automatically. I think this rewrite is only valid assuming l_partkey is a primary key, isn't it? This is knowledege we just don't have in dask 🤔

This is less about PK-FK relations but about the subquery correlation (which we definitely don't have right now).