coiled / benchmarks

BSD 3-Clause "New" or "Revised" License
27 stars 17 forks source link

Is duckdb out-of-core processing properly enabled? #1509

Open binste opened 2 months ago

binste commented 2 months ago

I just saw the great talk on Dask DataFrames 2.0 at PyData Berlin! I was a bit surprised that duckdb timed out for some of the queries. According to https://duckdb.org/docs/guides/performance/how_to_tune_workloads#larger-than-memory-workloads-out-of-core-processing, if you are not connected to a persistent duckdb database file, which I think is not the case based on the code in https://github.com/coiled/benchmarks/blob/63ca3c20cfd6c8352eebf880211e41a85793be32/tests/tpch/test_duckdb.py, you'd need to set a temporary directory so that duckdb can spill over to disk.

I'm not 100% if this is not set already somewhere else as I didn't dig through all the testing related code but thought you might want to know.

Related issues are #1488, #1214, and #1194.

hendrikmakait commented 2 months ago

@binste: Thanks for creating this issue. It looks like we have indeed missed this and there's no directory available to DuckDB for storing its data. I've created a PR that sets the appropriate config value and will investigate the impact this has on the performance/scalability of DuckDB.