Anyhow, running TPCH Q1 this creates very large partitions at around 500MiB. I ran this on various machine sizes but the profile below was created on an m6i.2xlarge, i.e. 8 CPUs, 32 GB -> 4GiB per core.
Given the relatively simple nature of Q1 I would expect the memory footprint to be about 8 (#Threads) x 500MiB x 2 (we surely copy stuff at some point) which should give a peak memory usage of about ~8GiB but instead I'm seeing that my workers are occasionally dying. The cluster is barely pulling through with a couple of casualties.
The full report is below but inspecting the peak the profile breaks down to
RSS peak is at about 25GiB which breaks down to about
6.8GiB raw data after concat (spot on to what I expect :white_check_mark: )
12.2 GiB a copy that is triggered by pandas during take!
3.7 GiB another pandas copy after assigning a column
… rest is some other minor stuff that’s being copied
These copies feel quite excessive and I wonder how/if they can be avoided. Apart from killing my poor cluster, they are also slowing us down quite a bit, I assume.
I'm testing a new FusedIO compression factor that works surprisingly well and accurate given some parquet file statistics, see https://github.com/dask-contrib/dask-expr/pull/917#discussion_r1509029209
Anyhow, running TPCH Q1 this creates very large partitions at around 500MiB. I ran this on various machine sizes but the profile below was created on an m6i.2xlarge, i.e. 8 CPUs, 32 GB -> 4GiB per core.
Given the relatively simple nature of Q1 I would expect the memory footprint to be about 8 (#Threads) x 500MiB x 2 (we surely copy stuff at some point) which should give a peak memory usage of about ~8GiB but instead I'm seeing that my workers are occasionally dying. The cluster is barely pulling through with a couple of casualties.
The full report is below but inspecting the peak the profile breaks down to
RSS peak is at about 25GiB which breaks down to about
These copies feel quite excessive and I wonder how/if they can be avoided. Apart from killing my poor cluster, they are also slowing us down quite a bit, I assume.
tpch-profiling-py310-worker-12af50bbd0.html.zip
https://cloud.coiled.io/clusters/400383/information?viewedAccount=%22dask-engineering%22&sinceMs=1709304341909&untilMs=1709304541909&tab=Code
pandas version is 2.2.1