Unmanaged memory because of block splitting in pandas

phofl commented 1 year ago

Describe the issue:

pandas started splitting blocks in 2.0 to improve performance of setitem when a full column is replaced. This keeps unused data in memory.

Minimal Complete Verifiable Example:

import dask.array as da
import dask.dataframe as dd

# Create columns with 400MB each
ddf = dd.from_array(da.random.random((50_000_000, 10)), columns=list("abcdefghij"))

ddf["b"] = 1
# ddf = ddf.rename(columns={"a": "x"})
ddf.persist()

cc @crusaderky we chatted offline about this last week. Anything we can do here? Should this be counted as managed memory? Rename triggers a deep copy before we persist, which brings the unmanaged memory down.

Anything else we need to know?:

Environment:

Dask version: 2023.04
pandas 2.0
Python version: 3.10
Operating System: Mac OS
Install method (conda, pip, source): conda

crusaderky commented 1 year ago

This keeps unused data in memory.

Reproduced. Indeed we need to fix this (through intelligent deep-copy?)

import gc
import time
import pandas
from distributed import wait, Client
import dask.array as da
import dask.dataframe as dd

client = Client(n_workers=1)

def dump_mem(label):
    client.run(gc.collect)
    time.sleep(3)  # Wait for memory to settle and for heartbeat
    print("=" * 80)
    print(label)
    print(client.run_on_scheduler(lambda dask_scheduler: dask_scheduler.memory))

print(pandas.__version__)
dump_mem("Empty cluster")

ddf = dd.from_array(da.random.random((500_000_000, 3)), columns=list("abc"))

ddf = ddf.persist()
wait(ddf)
dump_mem("Original dataframe")

ddf["b"] = 1
ddf = ddf.persist()
wait(ddf)
dump_mem("After setitem")

2.0.1
================================================================================
Empty cluster
Process memory (RSS)  : 129.17 MiB
  - managed by Dask   : 0 B
  - unmanaged (old)   : 74.61 MiB
  - unmanaged (recent): 54.56 MiB
Spilled to disk       : 0 B

================================================================================
Original dataframe
Process memory (RSS)  : 11.33 GiB
  - managed by Dask   : 11.18 GiB
  - unmanaged (old)   : 68.00 MiB
  - unmanaged (recent): 86.85 MiB
Spilled to disk       : 0 B

================================================================================
After setitem
Process memory (RSS)  : 15.05 GiB
  - managed by Dask   : 11.18 GiB
  - unmanaged (old)   : 68.00 MiB
  - unmanaged (recent): 3.81 GiB
Spilled to disk       : 0 B

crusaderky commented 1 year ago

Indeed we need to fix this (through intelligent deep-copy?)

Let me expand. Historically, dask has had this exact problem with numpy views, and it's been solved draconianly by ensuring that everything is always deep-copied. Now we face the same issue with pandas.

I can see three options:

Solution 1: Do nothing

dask DataFrames can consume substantially more memory than what sizeof() returns for them. This extra memory will be accounted as unmanaged memory, and it will disappear as soon as the task is either transferred over the network or is spilled and then unspilled.

User cost

Potentially, massive amounts of unexplicable unmanaged memory.
Potentially, hit the spill threshold faster, which will cause order-of-magnitude slowdowns
Potentially, hit the terminate threshold and end up with an unstable cluster
Unpredictable behaviour, particularly as soon as task stealing kicks in: on a run where a task is stolen, you'll have less unmanaged memory compared to a run where the same task was not stolen (due to the serialization roundtrip)

User benefit

Speed-up in end-to-end runtime, as long as this doesn't cause dask to hit the spill threshold faster
Coherent behaviour with vanilla pandas - e.g. local users of pandas will experience the same degradation in memory usage and the same speedup in runtime when upgrading from 1.5 to 2.0

Dev cost

Support a lot of confused and frustrated users

Dev benefit

Zero implementation effort

Solution 2: Deep-copy as needed

Ensure dask.dataframe forces a deep copy every time this kind of situation arises; e.g. there is never an invisible piece that is being kept alive. This behaviour is coherent with dask.array.

A pandas API for this (with zero cost when there's no invisible memory) - e.g. pandas.DataFrame.trim_unused_buffers would greatly simplify the implementation.

User cost

Miss out on runtime speedup

User benefit

Predictable, understandable behaviour

Dev cost

Quite a bit of legwork in dask.dataframe

Dev benefit

Predictable behaviour
Low support effort
Coherent design with dask.array

Solution 3: :sparkles: The Fancy One :sparkles:

Write a variant of sizeof() that returns two measures:

the current output of sizeof (the visible memory)
invisible memory, e.g. buffers that are no longer in use but are still referenced.

Track this invisible memory in the SpillBuffer, in the bokeh GUI, and in Prometheus. Encapsulate heuristics in the SpillBuffer so that, before you start actually spilling to disk, you instead just do a serialization/deserialization roundtrip to wipe away this extra memory (caveat: this is extremely expensive with object strings; the same pandas API as in point 2 would be very useful)

User cost

None

User benefit

As fast as option 1, as long as you have plenty of memory
Fairly easy to understand

Dev cost

Quite a bit involved. Heavily interacts with https://github.com/dask/distributed/issues/5868#issuecomment-1521623122
Risk of overengineering

Dev benefit

Can extend the exact same design to dask.array at a later date

phofl commented 1 year ago

Another example that produces a lot of unmanaged memory (not sure if this is already know, but this behaviour has been around forever):

ddf = dd.from_array(da.random.randint(1, 100, (length, 16)), columns=list("abcdefghijklmnop"))
ddf1 = ddf["a"].persist()

This might be a bigger problem than the initial example.

The initial example creates unmanaged memory temporarily, but a copy is triggered in most cases as soon as another operation is executed on the modified DataFrame. If the user keeps ddf1 alive for some reason, the whole memory isn't freed

crusaderky commented 1 year ago

Another example that produces a lot of unmanaged memory (not sure if this is already know, but this behaviour has been around forever):
ddf = dd.from_array(da.random.randint(1, 100, (length, 16)), columns=list("abcdefghijklmnop"))
ddf1 = ddf["a"].persist()

O_o I'm very surprised that dask.dataframe doesn't deep-copy like dask.array does

fjetter commented 1 year ago

IIUC this all boils down to dataframe.methods.assign which already has some logic around deepcopying. this likely needs refinement

https://github.com/dask/dask/blob/053c5425e415a84de9fef31d2bf94b8fb9ef477e/dask/dataframe/methods.py#L354

dask / distributed