Excessive copies in pandas drive up memory usage

fjetter commented 8 months ago

I'm testing a new FusedIO compression factor that works surprisingly well and accurate given some parquet file statistics, see https://github.com/dask-contrib/dask-expr/pull/917#discussion_r1509029209

Anyhow, running TPCH Q1 this creates very large partitions at around 500MiB. I ran this on various machine sizes but the profile below was created on an m6i.2xlarge, i.e. 8 CPUs, 32 GB -> 4GiB per core.

Given the relatively simple nature of Q1 I would expect the memory footprint to be about 8 (#Threads) x 500MiB x 2 (we surely copy stuff at some point) which should give a peak memory usage of about ~8GiB but instead I'm seeing that my workers are occasionally dying. The cluster is barely pulling through with a couple of casualties.

The full report is below but inspecting the peak the profile breaks down to

RSS peak is at about 25GiB which breaks down to about

6.8GiB raw data after concat (spot on to what I expect :white_check_mark: )
12.2 GiB a copy that is triggered by pandas during take!
3.7 GiB another pandas copy after assigning a column
… rest is some other minor stuff that’s being copied

These copies feel quite excessive and I wonder how/if they can be avoided. Apart from killing my poor cluster, they are also slowing us down quite a bit, I assume.

tpch-profiling-py310-worker-12af50bbd0.html.zip

https://cloud.coiled.io/clusters/400383/information?viewedAccount=%22dask-engineering%22&sinceMs=1709304341909&untilMs=1709304541909&tab=Code

pandas version is 2.2.1

mrocklin commented 8 months ago

Maybe copy-on-write helps a little with the 3.7 GB from column assignment? Probably doesn't help with take though?

fjetter commented 8 months ago

Well, both go away with COW 🎉

coiled / benchmarks

Excessive copies in pandas drive up memory usage #1430