Closed ericvsmith closed 7 years ago
I think we should allow __slots__
. Although they are not mainstream, they are still used. I am however not sure about API that we should use. I think @add_slots
still sounds like you patch an existing class. Maybe call it @with_slots
? Finally, maybe we still can use single decorator, but call the keyword with_slots
to distinguish it from other keywords? My point is that people who will use with_slots
are probably familiar with how slots work, so that they will not be surprised this option returns a new class.
I propose to punt this down the road. If people want slots they can
manually add __slots__ = ('x', 'y', 'z')
to their class.
Regarding whether people would be surprised by the need to generate a new class, I was surprised, and I built slots. :-)
In the future we can choose any of the other options. I would be fine with
eventually getting back slots=True
and only generating a new class if
that's given. (FWIW it should probably complain if any base class has a
__dict__
-- that's a common error case.)
In the meantime people can also use NamedTuple if they just want slots.
Agreed. I removed slots in PR #30. The git tag last-version-with-slots points to the code where slots was working.
@ericvsmith Adding __slots__
manually works as long as there are no defaults:
>>> @dataclass
... class C:
... __slots__ = {'x', 'y'}
... x: int
... y: int
...
>>> o = C(1,2)
>>> o
C(x=1, y=2)
>>> @dataclass
... class C:
... __slots__ = {'x', 'y'}
... x: int
... y: int = 1
...
Traceback (most recent call last):
File "<input>", line 1, in <module>
ValueError: 'y' in __slots__ conflicts with class variable
You're likely already aware of this, but I'm letting you know on the small chance it got missed.
(My interest in this is making dataclasses work with my "autoslot" toy class which injects slots into the class definition via a metaclass-enabled superclass: https://github.com/cjrh/autoslot. To make it compatible with @dataclass
, Inside my metaclass I can look for __annotations__
in the cls namespace, and that works fine, but I can't get around the class problem in the traceback above.)
Thinking it over, I think my use-case is different to what dataclasses are for, and so compatibility probably doesn't make sense anyway.
I totally think slots should be default behavior.
(Disclaimer - I gave the Pycon 2017 slots talk: https://www.youtube.com/watch?v=N7MfisN44nY and I had the latest contribution to the datamodel docs on __slots__
)
To break it down: slots add a data descriptor to the class that points to a slot in a struct-like datastructure. They get accessed pretty fast, and they take much less space than even the new smaller dict (like a tuple amount of space). It should be easy to programmatically determine if they should be added in the child or not. This should be a strictly dominant addition. But adding it later could break backwards compatibility if users start making the unfortunate decision to assume access to __dict__
directly or via vars
.
Here's some finer points relevant to the dataclasses, as I see it:
__dict__
- that just allows __dict__
to be created if accessed for a child. (Same for __weakref__
.) Basically, even if the parent allows __dict__
, as long as the child implementation only uses the correct slotted attributes, __dict__
isn't created. But I can see the value in warning/erroring for the case where users typo the attribute. We could have an argument, like no__dict__=True
that would ensure there's no slot for a __dict__
, or maybe allow__dict__
for the opposite.Without slots, the usability of data classes is really limited. When I would want to use something like this, it is almost always in a situation where I will have many instances of the same simple data points. Without __slots__
, that becomes untenable memory-wise. It's interesting that you can combine the two approaches when you don't set defaults, but the defaults are part of what make this useful in the first place.
You use code like @add_slots
from https://github.com/ericvsmith/dataclasses/blob/master/dataclass_tools.py
>>> from dataclasses import *
>>> from dataclass_tools import *
>>> @add_slots
... @dataclass
... class C:
... i: int = 10
...
>>> c=C()
>>> c
C(i=10)
>>> c.__slots__
('i',)
>>> c.j=0
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'C' object has no attribute 'j'
>>>
The reason this isn't in dataclasses
itself is because all other features just involve adding methods to your class. __slots__
requires creating a new class, because @dataclass
doesn't get control until after the class has been created, at which point it is too late to set __slots__
.
There are a few possible approaches here:
__slots__
after class creation. However, this would be difficult or maybe impossible. slots=True
to @dataclass
, which would create and return a new class. attrs
takes this approach. We'd want to make sure the implications of this are understood by the users.@add_slots
to dataclasses.py.I suggest taking this to python-ideas if you'd like to champion one of these ideas.
I like idea of adding @dataclass(slots=True)
It's very wasteful to have a struct-like data holder class, which relies on a bloated dynamic dictionary for storage. The slots behavior should be the only behavior and dict should be banished from dataclasses. Seriously.
But okay, if we manually add __slots__
to our classes (and do not use default values), will the resulting dataclass still work properly? Or will there be internal dataclass bugs caused by lacking a dict?
I just saw https://www.youtube.com/watch?v=T-TwcmT6Rcw on YouTube and it ends with saying that yes you can manually add slots to dataclasses.
But I have decided to use attrs instead. This comment from YouTube sums it up well:
For a company that does not allow external packages (due to code safety reasons), use dataclasses. For everyone else, always use the attrs package. It is much better. Dataclasses is a subset of attrs. So with attrs you can do everything and more. Attrs allows auto-generating "slots" to optimize memory usage, and allows adding validators if you want, etc.
To illustrate the need, here's an example for a class with 3 attributes, on 64-bit Python 3.7.4:
@dataclass
or@attr.s
class, creates a regular dictionary to hold all instance values. The class instances generated are exactly the same size regardless of the attrs or dataclass libraries are used... 424 bytes. And every new field you add bloats each class instance by +88 bytes.
@attr.s(slots=True)
creates a "slots" class to hold all instance values. The class instances only use 160 bytes in memory. And every new field you add increases the instance size by +40 bytes.So forget about dataclasses. Use the attrs library with slots. It offers more features, less memory, and more speed (since slots are faster than dictionaries). What's not to love?! ;)
I agree. Sure, you can add slots manually to dataclasses, but then you lose default values, and you have to manually write each variable name in the slots list. Ew. And the dataclass instance with manually written slots was only 8 bytes smaller than the equivalent attrs instance, which can be explained by attrs metadata variables or something like that, and isn't much extra RAM to pay for all the huge benefits of attrs.
import attr
from dataclasses import dataclass
from pympler import asizeof
import time
# every additional field adds 88 bytes
@attr.s
class A:
a = attr.ib(type=int, default=0)
b = attr.ib(type=int, default=4)
c = attr.ib(type=int, default=2)
d = attr.ib(type=int, default=8)
# every additional field adds 40 bytes
@attr.s(slots=True)
class B:
a = attr.ib(type=int, default=0)
b = attr.ib(type=int, default=4)
c = attr.ib(type=int, default=2)
d = attr.ib(type=int, default=8)
# every additional field adds 88 bytes
@dataclass
class C:
a: int = 0
b: int = 4
c: int = 2
d: int = 8
# every additional field adds 40 bytes
@dataclass
class D:
__slots__ = {"a", "b", "c", "d"}
a: int
b: int
c: int
d: int
Ainst = A()
Binst = B()
Cinst = C()
Dinst = D(0,4,2,8)
print("attrs size", asizeof.asizeof(Ainst)) # 512 bytes
print("attrs-with-slots size", asizeof.asizeof(Binst)) # 200 bytes
print("dataclass size", asizeof.asizeof(Cinst)) # 512 bytes
print("dataclass-with-slots size", asizeof.asizeof(Dinst)) # 192 bytes
s = time.perf_counter()
for i in range(0,250000000):
x = Ainst.a
elapsed = time.perf_counter() - s
print("elapsed attrs:", (elapsed*1000), "milliseconds")
s = time.perf_counter()
for i in range(0,250000000):
x = Binst.a
elapsed = time.perf_counter() - s
print("elapsed attrs-with-slots:", (elapsed*1000), "milliseconds")
s = time.perf_counter()
for i in range(0,250000000):
x = Cinst.a
elapsed = time.perf_counter() - s
print("elapsed dataclass:", (elapsed*1000), "milliseconds")
s = time.perf_counter()
for i in range(0,250000000):
x = Dinst.a
elapsed = time.perf_counter() - s
print("elapsed dataclass-with-slots:", (elapsed*1000), "milliseconds")
Results: Slots win heavily in the memory usage department, regardless of whether you use dataclass or attrs. And dataclass with manually written slots reduce total usage by 8 bytes (static number, does not change based on how many fields the class has) compared to attrs-with-slots. But dataclass loses with its lack of features, lack of default values if slots are used, and tedious way to write slots manually (see class "D").
attrs size 512
attrs-with-slots size 200
dataclass size 512
dataclass-with-slots size 192
As for data access benchmarks: The result varied too much between runs to draw any conclusions except to say that slots was slightly faster than dictionary-based storage. And that there's no real difference between the dataclass and attrs libraries in access-speed.
I suggest you raise this issue on the python-ideas mailing list. This tracker is just for the backport of dataclasses features to Python 3.6 (which admittedly I'm behind on, but I'll get to it).
When this issue is on python-ideas, I'll post my thoughts there.
@ericvsmith Ah I didn't realize that. I'll post on the mailing list.
For what it's worth, there is also typing.NamedTuple
which uses slots and you can also give a field a default value. Though fields with a default value must come after any fields without a default. Example:
class Employee(NamedTuple):
name: str
id: int = 3
Currently the draft PEP specifies and the code supports the optional ability to add
__slots__
. This is the one place where@dataclass
cannot just modify the given class and return it: because__slots__
must be specified at class creation time, it's too late by the time thedataclass
decorator gets control. The current approach is to dynamically generate a new class while setting__slots__
in the new class and copying over other class attributes. The decorator then returns the new class.The question is: do we even want to support setting
__slots__
? Is having__slots__
important enough to have this deviation from the "we just add a few dunder methods to your class" behavior?I see three options:
@dataclass(slots=True)
returning a new class.__slots__
.@add_slots
, which takes a data class and creates a new class with__slots__
set.I think we should either go with 2 or 3. I don't mind not supporting
__slots__
, but if we do want to support it, I think it's easier to explain with a separate decorator.It would be an error to use
@add_slots
on a non-dataclass class.