coiled / benchmarks

BSD 3-Clause "New" or "Revised" License
32 stars 17 forks source link

[TPCH] Polars turn off streaming #1512

Closed fjetter closed 5 months ago

fjetter commented 6 months ago

We received some feedback that the streaming API is not the recommended way of running things and we should rather be running it in the normal mode

cc @ritchie46

fjetter commented 6 months ago

To be clear about the impact here. We are running our benchmarks on multiple hardware and data configurations and by disabling streaming, polars is not really able to run queries that operate on data that is larger than memory. From our configurations, this excludes all configurations but (local, 10) and (cloud, 100). (@ritchie46 please correct me if I'm saying anything dumb or if there is another mode/configuration we should consider)

This is not a dramatic difference to what we're doing right now since we already excluded polars from the other configurations since it didn't work well. We currently report numbers for some of the (local, 100) queries but with this change we should stop reporting those. At least when I try to run this, my system can't handle it.

ritchie46 commented 5 months ago

Yes, Polars is an in-memory query engine at this point in time. It isn't really designed to run on datasets that don't fit into memory (after filtering/optimization). The streaming engine is a beta feature that is completely redesigned and far from ready.

I think for benchmarking there should be a clear difference between polars-default and polars-streaming and later polars-gpu as they are completely different engines.

In any case, default should be runned. Once we can do larger datasets we will also toggle into that in the default engine.