coiled / benchmarks

BSD 3-Clause "New" or "Revised" License
32 stars 17 forks source link

Excessive copies in pandas drive up memory usage #1430

Closed fjetter closed 8 months ago

fjetter commented 8 months ago

I'm testing a new FusedIO compression factor that works surprisingly well and accurate given some parquet file statistics, see https://github.com/dask-contrib/dask-expr/pull/917#discussion_r1509029209

Anyhow, running TPCH Q1 this creates very large partitions at around 500MiB. I ran this on various machine sizes but the profile below was created on an m6i.2xlarge, i.e. 8 CPUs, 32 GB -> 4GiB per core.

Given the relatively simple nature of Q1 I would expect the memory footprint to be about 8 (#Threads) x 500MiB x 2 (we surely copy stuff at some point) which should give a peak memory usage of about ~8GiB but instead I'm seeing that my workers are occasionally dying. The cluster is barely pulling through with a couple of casualties.

The full report is below but inspecting the peak the profile breaks down to

RSS peak is at about 25GiB which breaks down to about

image

These copies feel quite excessive and I wonder how/if they can be avoided. Apart from killing my poor cluster, they are also slowing us down quite a bit, I assume.

tpch-profiling-py310-worker-12af50bbd0.html.zip

https://cloud.coiled.io/clusters/400383/information?viewedAccount=%22dask-engineering%22&sinceMs=1709304341909&untilMs=1709304541909&tab=Code

pandas version is 2.2.1

mrocklin commented 8 months ago

Maybe copy-on-write helps a little with the 3.7 GB from column assignment? Probably doesn't help with take though?

fjetter commented 8 months ago

Well, both go away with COW 🎉