Dask tests could use optimization

shughes-uk commented 1 year ago

Hi there,

At Coiled we have better optimized versions of the tests

https://github.com/coiled/benchmarks/blob/main/tests/benchmarks/test_h2o.py

Any chance of things being updated to give dask a fairer shot?

jangorecki commented 1 year ago

Thanks, code looks very promising. Definitely worth to add. Any idea about q10? Even if there is no optimized version for that one, we should still provide valid syntax. We can as well define exception comment that is used to explain the issue, even mentioning gh issue number. I would also suggest to file a FR for dask to have an argument in csv reader function so it can do all those optimizations related to parquet parrow file splitting automatically.

shughes-uk commented 1 year ago

Im traveling today but may have some time to poke around and make some tweaks.

fjetter commented 1 year ago

Apart from the code itself, I think a big-ish issue with the current setup is that I suspect our default deployment configuration is not ideal for the machine the benchmarks are running on. IIUC the benchmark server is a c6id.metal machine with 128 cores and 246GiB RAM. That will launch 16 workers (processes) with 8 threads each. That's not necessarily a sweet spot for dask and something we can improve on. (too many threads per process cause GIL contention). This doesn't explain why we're running out of memory.

Looking over the code briefly, I guess the most critical problem right now is that we are not pushing down column projections automatically. This is where https://github.com/dask-contrib/dask-expr should make a big difference.

Regarding the missing features, I don't think there is anything missing. I opened https://github.com/duckdblabs/db-benchmark/pull/58 to re-enable those missing queries. A follow up PR can go over the existing code and clean that up

duckdblabs / db-benchmark

Dask tests could use optimization #56