coiled / benchmarks

BSD 3-Clause "New" or "Revised" License
28 stars 17 forks source link

[TPC-H] Adopt Spark join ordering in query 18 #1406

Closed hendrikmakait closed 7 months ago

hendrikmakait commented 7 months ago

The join ordering we chose in query 18 is suboptimal and causes workers run OOM. This PR reimplements the query using Spark's join ordering. Note that Polars join ordering (https://github.com/coiled/benchmarks/commit/8eb24d068401269b665c485b27ee53d8b748da47) appears to be slightly faster (285s vs. 318s), but I do not want to optimize too much manually. Picking Spark's join ordering will provide the best apples-to-apples comparison against Dask's main competitor.