Consider using object pools for internal state keeping

fjetter commented 1 month ago

Dask is creating many small objects and not just for very large graphs but this is pretty much a built in thing. This is one of not the most prominent reason why dask was originally built as a "tuple of tuple of tuple ..." machinery. Even the scheduler internally only adopted the usage of custom classes only a couple of years back because instantiation can be very costly. However, modern python versions have gotten much better at managing overhead so this is negligible for most classes.

The one type of objects we're still affected by, both in terms of memory but also in terms of runtime are sets. Yes, sets! (I might share profiles but this is mostly an issue to preseve an idea). One way to work around instantiation cost of sets is to use an object pool design. Effectively, we'd "disable" garbage collection and would resurrect objects on finalization in a way that would allow us to reuse them.

A minimal version of this would look like that

import sys

class SetPool:
    def __init__(self):
        self._sets = []

    def add(self, obj):
        obj.clear()
        self._sets.append(obj)

    def get(self):
        try:
            new = self._sets.pop()
            return new
        except:
            return None

    def stored_size(self):
        return sum(map(sys.getsizeof, self._sets))

globalpool = SetPool()

class PooledSet(set):

    def __new__(cls, *iterables):
        obj = globalpool.get()
        if obj is not None:
            return obj
        return super().__new__(cls, *iterables)

    def __del__(self):
        print(f"Resurrecting {id(self)}")
        globalpool.add(self)

This could be expanded on need, e.g. by hinting towards whether this would be an empty set, a very large set or a somewhat normal one.

Haven't tried out what the actual impact would be but it is a fun concept that could help if we actually want/need to optimize for this

mrocklin commented 1 month ago

I'm surprised that set construction is a big deal. (although more like 1-sigma surprised rather than 3-sigma surprised). One thought: it might be worth asking a CPython person about this? Someone like Antoine? (intentionally not pinging his github in case this is a silly idea)

fjetter commented 1 month ago

Yeah, it's funny and was surprised when I initially saw this (months ago). dicts, for instance, are way better optimized since they are used internally in CPython all over the place but sets are a bit meh...

For example, size of empty sets vs empty dicts

>>> import sys
>>> sys.getsizeof(set())
216
>>> sys.getsizeof(dict())
64

fjetter commented 1 month ago

I wouldn't be surprised if we could make this go faster if one tuned the python allocator but I'm not sure if I want to go down that hole.

dask / distributed

Consider using object pools for internal state keeping #8901