dask / distributed

A distributed task scheduler for Dask
https://distributed.dask.org
BSD 3-Clause "New" or "Revised" License
1.58k stars 718 forks source link

Profiling data of Scheduler.update_graph for very large graph #7998

Open fjetter opened 1 year ago

fjetter commented 1 year ago

I recently had the pleasure to see how the scheduler reacts to a very large graph. Not too well.

I submitted a graph with a couple million tasks. Locally it looks like 2.5MM tasks but the scheduler later says less. Anyhow, it's seven digits. update_graph ran for about 5min, i.e. also blocking the event loop for that time (https://github.com/dask/distributed/issues/7980)

What is eating up the most time is

Function value
particularly this check for __main__ in dumps result 5%
stringfiy 12%
key_split 12%
unpack_remotedata 12%
generate_taskstate 20%
dask.order 12%
transitioning all tasks 10%
Other foo (e.g. walking the graph for deps and such) 17%

It also looks like the TaskState and all the foo attached to them is taking up about 65% of the memory which in this case is about 82GiB. Assuming we're at 2MM tasks that's roundabout 40KB per TaskState. That's quite a lot.

scheduler-profile.zip

Nothing to do here, this is purely informational.

fjetter commented 1 year ago

I did some memory profiling of the scheduler [1] based on 145c13aea1b4214a8bb1378f581104492b1be0c7

I submitted a large array workload with about 1.5MM tasks. The scheduler requires about 6GB of RAM to hold the computation state in memory. The peak is a bit larger since there is some intermediate state required (mostly for dask.order).

image

Once the plateau of this graph is reached, computation starts and the memory usage breaks down roughly as

5836 MiB Total
├── 290 MiB Raw graph -> (This is only part of the raw deserialized graph. The original one is about 680MiB)
├── 1945 MiB Materialized Graph
│   ├── 584 MiB Stringification of values of dsk [2]
│   ├── 132 MiB Stringification of keys [3]
│   ├── 624 MiB Stringification of dependencies [4]
│   ├── 475 MiB dumps_task (removed on main)
│   └── 130 MiB Other
├── 3379 MiB generate_taskstates
│   ├── 377 MiB key_split (group keys)
│   ├── 80 MiB actual tasks dict
│   └── 2970 MiB TaskState object
│       ├── 347 MiB slots (actual object space)
│       ├── 112 MiB weakref instances on self
│       ├── 110 MiB key_split_group
│       └── 2355 MiB various sets in TaskState (profiler points to waiting_on)
└── 222 MiB Other

The two big contributions worth discussing is the TaskState that allocates more than 2GiB and materialize graph

The tracing for the TaskState object is a little fuzzy (possibly because it is using slots?) but it largely points to the usage of sets in TaskState. Indeed, empty sets are allocating relatively high memory. With 9 sets and 3 dictionaries we're already at a lower bound per TaskState of 2.39KiB

# Python 3.10.11
format_bytes(
   ...:     9 * sys.getsizeof(set())
   ...:     + 3 * sys.getsizeof(dict())
   ...:     )
Out[9]: '2.09 kiB'

which adds up to almost 3GiB alone for 1.5MM tasks. The actual memory use is even better than this calculation suggests (not sure what went wrong here...)

The other large contribution is the stringification of keys. Stringify does not cache/deduplicate str values, nor is the python interpreter able to intern our keys (afaik, only possible w/ ascii chars) every call to stringify effectively allocates new memory. While the actual stringified keys only take 132MiB in this example, the lack of duplication blow up to much more.

This suggests that we should either remove or rework stringification and possibly consider a slimmer representation of our TaskState object.


[1] scheduler_memory_profile.html.zip [2] https://github.com/dask/distributed/blob/145c13aea1b4214a8bb1378f581104492b1be0c7/distributed/scheduler.py#L4769 [3] https://github.com/dask/distributed/blob/145c13aea1b4214a8bb1378f581104492b1be0c7/distributed/scheduler.py#L4767 [4] https://github.com/dask/distributed/blob/145c13aea1b4214a8bb1378f581104492b1be0c7/distributed/scheduler.py#L4773

fjetter commented 1 year ago

Note that the above graph wasn't using any annotations. Annotations will add one more stringification for annotated keys