Open shughes-uk opened 1 year ago
Thanks, code looks very promising. Definitely worth to add. Any idea about q10? Even if there is no optimized version for that one, we should still provide valid syntax. We can as well define exception comment that is used to explain the issue, even mentioning gh issue number. I would also suggest to file a FR for dask to have an argument in csv reader function so it can do all those optimizations related to parquet parrow file splitting automatically.
Im traveling today but may have some time to poke around and make some tweaks.
Apart from the code itself, I think a big-ish issue with the current setup is that I suspect our default deployment configuration is not ideal for the machine the benchmarks are running on. IIUC the benchmark server is a c6id.metal
machine with 128 cores and 246GiB RAM. That will launch 16 workers (processes) with 8 threads each. That's not necessarily a sweet spot for dask and something we can improve on. (too many threads per process cause GIL contention). This doesn't explain why we're running out of memory.
Looking over the code briefly, I guess the most critical problem right now is that we are not pushing down column projections automatically. This is where https://github.com/dask-contrib/dask-expr should make a big difference.
Regarding the missing features, I don't think there is anything missing. I opened https://github.com/duckdblabs/db-benchmark/pull/58 to re-enable those missing queries. A follow up PR can go over the existing code and clean that up
Hi there,
At Coiled we have better optimized versions of the tests
https://github.com/coiled/benchmarks/blob/main/tests/benchmarks/test_h2o.py
Any chance of things being updated to give dask a fairer shot?