Closed fjetter closed 5 months ago
To be clear about the impact here. We are running our benchmarks on multiple hardware and data configurations and by disabling streaming, polars is not really able to run queries that operate on data that is larger than memory. From our configurations, this excludes all configurations but (local, 10) and (cloud, 100). (@ritchie46 please correct me if I'm saying anything dumb or if there is another mode/configuration we should consider)
This is not a dramatic difference to what we're doing right now since we already excluded polars from the other configurations since it didn't work well. We currently report numbers for some of the (local, 100) queries but with this change we should stop reporting those. At least when I try to run this, my system can't handle it.
Yes, Polars is an in-memory query engine at this point in time. It isn't really designed to run on datasets that don't fit into memory (after filtering/optimization). The streaming engine is a beta feature that is completely redesigned and far from ready.
I think for benchmarking there should be a clear difference between polars-default
and polars-streaming
and later polars-gpu
as they are completely different engines.
In any case, default
should be runned. Once we can do larger datasets we will also toggle into that in the default engine.
We received some feedback that the streaming API is not the recommended way of running things and we should rather be running it in the normal mode
cc @ritchie46