coiled / benchmarks

BSD 3-Clause "New" or "Revised" License
32 stars 17 forks source link

[TPC-H] Generate expected results for SF > 1 #1443

Closed hendrikmakait closed 8 months ago

hendrikmakait commented 8 months ago

At SF==1, our TPC-H queries boil down to "overcomplicated pandas" since we don't really partition anything. At larger scale factors, several different size-based optimizations kick in and significantly alter the query plans. As a result, results may be correct at SF==1, but may be wrong at SF>1 (see https://github.com/dask-contrib/dask-expr/pull/917#issuecomment-1983112150).

We should generate the expected results at larger scale factors to be able to run the correctness tests at larger scale.

phofl commented 8 months ago

Yes having results for scale > 1 makes sense

I am cautiously optimistic that the result in the linked PR is correct though, Polars also ends up with an empty result

hendrikmakait commented 8 months ago

Polars also ends up with an empty result

That really doesn't make me feel any better about our input data :D I'll investigate this a bit more.