dask / dask-expr

BSD 3-Clause "New" or "Revised" License
79 stars 18 forks source link

Optimize join ordering #1065

Open hendrikmakait opened 1 month ago

hendrikmakait commented 1 month ago

Problem

Currently, we execute joins in the order they were given by the user. If the user does not pay attention, this can cause a significant performance penalty due to an unnecessary explosion of intermediate results.

Solution

We should automatically optimize the join ordering. Ideally, we have cardinality estimates for this from Parquet files or a metadata store, but we should also try to optimize join ordering without meaningful statistics. Possible approaches here include optimization based on partition counts or equivalence sets (https://blobs.duckdb.org/papers/tom-ebergen-msc-thesis-join-order-optimization-with-almost-no-statistics.pdf).

Previous work

We have already experimented with this, but never gotten to a working solution that we could merge (e.g., #809).