ericvsmith / dataclasses

Apache License 2.0
584 stars 53 forks source link

Support __slots__? #28

Closed ericvsmith closed 7 years ago

ericvsmith commented 7 years ago

Currently the draft PEP specifies and the code supports the optional ability to add __slots__. This is the one place where @dataclass cannot just modify the given class and return it: because __slots__ must be specified at class creation time, it's too late by the time the dataclass decorator gets control. The current approach is to dynamically generate a new class while setting __slots__ in the new class and copying over other class attributes. The decorator then returns the new class.

The question is: do we even want to support setting __slots__? Is having __slots__ important enough to have this deviation from the "we just add a few dunder methods to your class" behavior?

I see three options:

  1. Leave it as-is, with @dataclass(slots=True) returning a new class.
  2. Completely remove support for setting __slots__.
  3. Add a different decorator, say @add_slots, which takes a data class and creates a new class with __slots__ set.

I think we should either go with 2 or 3. I don't mind not supporting __slots__, but if we do want to support it, I think it's easier to explain with a separate decorator.

@add_slots
@dataclass
class C:
    x: int
    y: int

It would be an error to use @add_slots on a non-dataclass class.

ilevkivskyi commented 7 years ago

I think we should allow __slots__. Although they are not mainstream, they are still used. I am however not sure about API that we should use. I think @add_slots still sounds like you patch an existing class. Maybe call it @with_slots? Finally, maybe we still can use single decorator, but call the keyword with_slots to distinguish it from other keywords? My point is that people who will use with_slots are probably familiar with how slots work, so that they will not be surprised this option returns a new class.

gvanrossum commented 7 years ago

I propose to punt this down the road. If people want slots they can manually add __slots__ = ('x', 'y', 'z') to their class.

Regarding whether people would be surprised by the need to generate a new class, I was surprised, and I built slots. :-)

In the future we can choose any of the other options. I would be fine with eventually getting back slots=True and only generating a new class if that's given. (FWIW it should probably complain if any base class has a __dict__ -- that's a common error case.)

In the meantime people can also use NamedTuple if they just want slots.

ericvsmith commented 7 years ago

Agreed. I removed slots in PR #30. The git tag last-version-with-slots points to the code where slots was working.

cjrh commented 6 years ago

@ericvsmith Adding __slots__ manually works as long as there are no defaults:

>>> @dataclass
... class C:
...     __slots__ = {'x', 'y'}
...     x: int
...     y: int
...     
>>> o = C(1,2)
>>> o
C(x=1, y=2)
>>> @dataclass
... class C:
...     __slots__ = {'x', 'y'}
...     x: int
...     y: int = 1
...     
Traceback (most recent call last):
  File "<input>", line 1, in <module>
ValueError: 'y' in __slots__ conflicts with class variable

You're likely already aware of this, but I'm letting you know on the small chance it got missed.

(My interest in this is making dataclasses work with my "autoslot" toy class which injects slots into the class definition via a metaclass-enabled superclass: https://github.com/cjrh/autoslot. To make it compatible with @dataclass, Inside my metaclass I can look for __annotations__ in the cls namespace, and that works fine, but I can't get around the class problem in the traceback above.)

cjrh commented 6 years ago

Thinking it over, I think my use-case is different to what dataclasses are for, and so compatibility probably doesn't make sense anyway.

aaronchall commented 6 years ago

I totally think slots should be default behavior.

(Disclaimer - I gave the Pycon 2017 slots talk: https://www.youtube.com/watch?v=N7MfisN44nY and I had the latest contribution to the datamodel docs on __slots__)

To break it down: slots add a data descriptor to the class that points to a slot in a struct-like datastructure. They get accessed pretty fast, and they take much less space than even the new smaller dict (like a tuple amount of space). It should be easy to programmatically determine if they should be added in the child or not. This should be a strictly dominant addition. But adding it later could break backwards compatibility if users start making the unfortunate decision to assume access to __dict__ directly or via vars.

Here's some finer points relevant to the dataclasses, as I see it:

dan-blanchard commented 6 years ago

Without slots, the usability of data classes is really limited. When I would want to use something like this, it is almost always in a situation where I will have many instances of the same simple data points. Without __slots__, that becomes untenable memory-wise. It's interesting that you can combine the two approaches when you don't set defaults, but the defaults are part of what make this useful in the first place.

ericvsmith commented 6 years ago

You use code like @add_slots from https://github.com/ericvsmith/dataclasses/blob/master/dataclass_tools.py

>>> from dataclasses import *
>>> from dataclass_tools import *
>>> @add_slots
... @dataclass
... class C:
...    i: int = 10
...
>>> c=C()
>>> c
C(i=10)
>>> c.__slots__
('i',)
>>> c.j=0
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'C' object has no attribute 'j'
>>>

The reason this isn't in dataclasses itself is because all other features just involve adding methods to your class. __slots__ requires creating a new class, because @dataclass doesn't get control until after the class has been created, at which point it is too late to set __slots__.

There are a few possible approaches here:

I suggest taking this to python-ideas if you'd like to champion one of these ideas.

YoSTEALTH commented 5 years ago

I like idea of adding @dataclass(slots=True)

Arcitec commented 5 years ago

It's very wasteful to have a struct-like data holder class, which relies on a bloated dynamic dictionary for storage. The slots behavior should be the only behavior and dict should be banished from dataclasses. Seriously.

But okay, if we manually add __slots__ to our classes (and do not use default values), will the resulting dataclass still work properly? Or will there be internal dataclass bugs caused by lacking a dict?

Arcitec commented 5 years ago

I just saw https://www.youtube.com/watch?v=T-TwcmT6Rcw on YouTube and it ends with saying that yes you can manually add slots to dataclasses.

But I have decided to use attrs instead. This comment from YouTube sums it up well:

For a company that does not allow external packages (due to code safety reasons), use dataclasses. For everyone else, always use the attrs package. It is much better. Dataclasses is a subset of attrs. So with attrs you can do everything and more. Attrs allows auto-generating "slots" to optimize memory usage, and allows adding validators if you want, etc.

To illustrate the need, here's an example for a class with 3 attributes, on 64-bit Python 3.7.4:

  • @dataclass or @attr.s class, creates a regular dictionary to hold all instance values. The class instances generated are exactly the same size regardless of the attrs or dataclass libraries are used... 424 bytes. And every new field you add bloats each class instance by +88 bytes.

  • @attr.s(slots=True) creates a "slots" class to hold all instance values. The class instances only use 160 bytes in memory. And every new field you add increases the instance size by +40 bytes.

So forget about dataclasses. Use the attrs library with slots. It offers more features, less memory, and more speed (since slots are faster than dictionaries). What's not to love?! ;)

I agree. Sure, you can add slots manually to dataclasses, but then you lose default values, and you have to manually write each variable name in the slots list. Ew. And the dataclass instance with manually written slots was only 8 bytes smaller than the equivalent attrs instance, which can be explained by attrs metadata variables or something like that, and isn't much extra RAM to pay for all the huge benefits of attrs.

Arcitec commented 5 years ago
import attr
from dataclasses import dataclass
from pympler import asizeof
import time

# every additional field adds 88 bytes
@attr.s
class A:
    a = attr.ib(type=int, default=0)
    b = attr.ib(type=int, default=4)
    c = attr.ib(type=int, default=2)
    d = attr.ib(type=int, default=8)

# every additional field adds 40 bytes
@attr.s(slots=True)
class B:
    a = attr.ib(type=int, default=0)
    b = attr.ib(type=int, default=4)
    c = attr.ib(type=int, default=2)
    d = attr.ib(type=int, default=8)

# every additional field adds 88 bytes
@dataclass
class C:
    a: int = 0
    b: int = 4
    c: int = 2
    d: int = 8

# every additional field adds 40 bytes
@dataclass
class D:
    __slots__ = {"a", "b", "c", "d"}
    a: int
    b: int
    c: int
    d: int

Ainst = A()
Binst = B()
Cinst = C()
Dinst = D(0,4,2,8)

print("attrs size", asizeof.asizeof(Ainst)) # 512 bytes

print("attrs-with-slots size", asizeof.asizeof(Binst)) # 200 bytes

print("dataclass size", asizeof.asizeof(Cinst)) # 512 bytes

print("dataclass-with-slots size", asizeof.asizeof(Dinst)) # 192 bytes

s = time.perf_counter()
for i in range(0,250000000):
    x = Ainst.a
elapsed = time.perf_counter() - s
print("elapsed attrs:", (elapsed*1000), "milliseconds")

s = time.perf_counter()
for i in range(0,250000000):
    x = Binst.a
elapsed = time.perf_counter() - s
print("elapsed attrs-with-slots:", (elapsed*1000), "milliseconds")

s = time.perf_counter()
for i in range(0,250000000):
    x = Cinst.a
elapsed = time.perf_counter() - s
print("elapsed dataclass:", (elapsed*1000), "milliseconds")

s = time.perf_counter()
for i in range(0,250000000):
    x = Dinst.a
elapsed = time.perf_counter() - s
print("elapsed dataclass-with-slots:", (elapsed*1000), "milliseconds")

Results: Slots win heavily in the memory usage department, regardless of whether you use dataclass or attrs. And dataclass with manually written slots reduce total usage by 8 bytes (static number, does not change based on how many fields the class has) compared to attrs-with-slots. But dataclass loses with its lack of features, lack of default values if slots are used, and tedious way to write slots manually (see class "D").

attrs size 512
attrs-with-slots size 200
dataclass size 512
dataclass-with-slots size 192

As for data access benchmarks: The result varied too much between runs to draw any conclusions except to say that slots was slightly faster than dictionary-based storage. And that there's no real difference between the dataclass and attrs libraries in access-speed.

ericvsmith commented 5 years ago

I suggest you raise this issue on the python-ideas mailing list. This tracker is just for the backport of dataclasses features to Python 3.6 (which admittedly I'm behind on, but I'll get to it).

When this issue is on python-ideas, I'll post my thoughts there.

Arcitec commented 5 years ago

@ericvsmith Ah I didn't realize that. I'll post on the mailing list.

ciupicri commented 5 years ago

For what it's worth, there is also typing.NamedTuple which uses slots and you can also give a field a default value. Though fields with a default value must come after any fields without a default. Example:

class Employee(NamedTuple):
    name: str
    id: int = 3