Open fjetter opened 1 month ago
I'm surprised that set construction is a big deal. (although more like 1-sigma surprised rather than 3-sigma surprised). One thought: it might be worth asking a CPython person about this? Someone like Antoine? (intentionally not pinging his github in case this is a silly idea)
Yeah, it's funny and was surprised when I initially saw this (months ago). dicts, for instance, are way better optimized since they are used internally in CPython all over the place but sets are a bit meh...
For example, size of empty sets vs empty dicts
>>> import sys
>>> sys.getsizeof(set())
216
>>> sys.getsizeof(dict())
64
I wouldn't be surprised if we could make this go faster if one tuned the python allocator but I'm not sure if I want to go down that hole.
Dask is creating many small objects and not just for very large graphs but this is pretty much a built in thing. This is one of not the most prominent reason why dask was originally built as a "tuple of tuple of tuple ..." machinery. Even the scheduler internally only adopted the usage of custom classes only a couple of years back because instantiation can be very costly. However, modern python versions have gotten much better at managing overhead so this is negligible for most classes.
The one type of objects we're still affected by, both in terms of memory but also in terms of runtime are sets. Yes, sets! (I might share profiles but this is mostly an issue to preseve an idea). One way to work around instantiation cost of sets is to use an object pool design. Effectively, we'd "disable" garbage collection and would resurrect objects on finalization in a way that would allow us to reuse them.
A minimal version of this would look like that
This could be expanded on need, e.g. by hinting towards whether this would be an empty set, a very large set or a somewhat normal one.
Haven't tried out what the actual impact would be but it is a fun concept that could help if we actually want/need to optimize for this