coiled / benchmarks

BSD 3-Clause "New" or "Revised" License
28 stars 17 forks source link

[TPC-H] Query 9 memory issue #1381

Open hendrikmakait opened 7 months ago

hendrikmakait commented 7 months ago

The memory issue in query 9 is caused by dask-expr choosing a P2P hash join when joining the data with the nation dataset as the very last join in https://github.com/coiled/benchmarks/blob/e2935b00029bf5e24ef8ca93391a06e3ced1cd1e/tests/tpch/dask_queries.py#L422

As mentioned in #1380, the nation table has 25 partitions for 25 values, so the entire joined dataset will be partitioned into (at most) 25 partitions that contain data, which leads to issues when loading it in the unpack phase. If explicitly choosing a broadcast join via broadcast=True, the memory spike disappears:

P2P Hash Join Cluster: https://cloud.coiled.io/clusters/382567/information?viewedAccount=%22dask-benchmarks%22

Screenshot 2024-02-13 at 08 00 59

Broadcast Join Cluster: https://cloud.coiled.io/clusters/382962/information?viewedAccount=%22dask-benchmarks%22&tab=Metrics

Screenshot 2024-02-13 at 08 00 12