Closed hendrikmakait closed 8 months ago
Yes having results for scale > 1 makes sense
I am cautiously optimistic that the result in the linked PR is correct though, Polars also ends up with an empty result
Polars also ends up with an empty result
That really doesn't make me feel any better about our input data :D I'll investigate this a bit more.
At SF==1, our TPC-H queries boil down to "overcomplicated pandas" since we don't really partition anything. At larger scale factors, several different size-based optimizations kick in and significantly alter the query plans. As a result, results may be correct at SF==1, but may be wrong at SF>1 (see https://github.com/dask-contrib/dask-expr/pull/917#issuecomment-1983112150).
We should generate the expected results at larger scale factors to be able to run the correctness tests at larger scale.