Open fjetter opened 1 year ago
I did some memory profiling of the scheduler [1] based on 145c13aea1b4214a8bb1378f581104492b1be0c7
I submitted a large array workload with about 1.5MM tasks. The scheduler requires about 6GB of RAM to hold the computation state in memory. The peak is a bit larger since there is some intermediate state required (mostly for dask.order).
Once the plateau of this graph is reached, computation starts and the memory usage breaks down roughly as
5836 MiB Total
├── 290 MiB Raw graph -> (This is only part of the raw deserialized graph. The original one is about 680MiB)
├── 1945 MiB Materialized Graph
│ ├── 584 MiB Stringification of values of dsk [2]
│ ├── 132 MiB Stringification of keys [3]
│ ├── 624 MiB Stringification of dependencies [4]
│ ├── 475 MiB dumps_task (removed on main)
│ └── 130 MiB Other
├── 3379 MiB generate_taskstates
│ ├── 377 MiB key_split (group keys)
│ ├── 80 MiB actual tasks dict
│ └── 2970 MiB TaskState object
│ ├── 347 MiB slots (actual object space)
│ ├── 112 MiB weakref instances on self
│ ├── 110 MiB key_split_group
│ └── 2355 MiB various sets in TaskState (profiler points to waiting_on)
└── 222 MiB Other
The two big contributions worth discussing is the TaskState
that allocates more than 2GiB and materialize graph
The tracing for the TaskState object is a little fuzzy (possibly because it is using slots?) but it largely points to the usage of sets in TaskState. Indeed, empty sets are allocating relatively high memory. With 9 sets and 3 dictionaries we're already at a lower bound per TaskState of 2.39KiB
# Python 3.10.11
format_bytes(
...: 9 * sys.getsizeof(set())
...: + 3 * sys.getsizeof(dict())
...: )
Out[9]: '2.09 kiB'
which adds up to almost 3GiB alone for 1.5MM tasks. The actual memory use is even better than this calculation suggests (not sure what went wrong here...)
The other large contribution is the stringification of keys. Stringify does not cache/deduplicate str values, nor is the python interpreter able to intern our keys (afaik, only possible w/ ascii chars) every call to stringify effectively allocates new memory. While the actual stringified keys only take 132MiB in this example, the lack of duplication blow up to much more.
This suggests that we should either remove or rework stringification and possibly consider a slimmer representation of our TaskState object.
[1] scheduler_memory_profile.html.zip [2] https://github.com/dask/distributed/blob/145c13aea1b4214a8bb1378f581104492b1be0c7/distributed/scheduler.py#L4769 [3] https://github.com/dask/distributed/blob/145c13aea1b4214a8bb1378f581104492b1be0c7/distributed/scheduler.py#L4767 [4] https://github.com/dask/distributed/blob/145c13aea1b4214a8bb1378f581104492b1be0c7/distributed/scheduler.py#L4773
Note that the above graph wasn't using any annotations. Annotations will add one more stringification for annotated keys
I recently had the pleasure to see how the scheduler reacts to a very large graph. Not too well.
I submitted a graph with a couple million tasks. Locally it looks like 2.5MM tasks but the scheduler later says less. Anyhow, it's seven digits. update_graph ran for about 5min, i.e. also blocking the event loop for that time (https://github.com/dask/distributed/issues/7980)
What is eating up the most time is
__main__
in dumps resultIt also looks like the TaskState and all the foo attached to them is taking up about 65% of the memory which in this case is about 82GiB. Assuming we're at 2MM tasks that's roundabout 40KB per TaskState. That's quite a lot.
scheduler-profile.zip
Nothing to do here, this is purely informational.