Closed ilevkivskyi closed 7 years ago
After some thinking, I also added tuple_factory
so that:
astuple()
can used as an adaptor to a legacy named tuple APIasdict()
Thanks for the work! I'm still reviewing it.
But one quick note: Wouldn't a better name for copy_fields
be shallow_copy
?
But one quick note: Wouldn't a better name for
copy_fields
beshallow_copy
?
I now have an alternative idea: call it copy_function
and make it accept a function (or None
), so a user can opt-out of copying, or give copy.deepcopy
or other custom function.
OK, pushed new commit. Now a user can choose copy_function
and dict_factory
, the default are copy.copy
and dict
(and similar for tuples).
Sorry to take so long to review this. I'm not crazy about the copy_function
parameter. Other than copy.copy
and copy.deepcopy
, what other parameters do you envision being used? If there are only a few options, let's just support those instead of over-generalizing.
Note that attrs
makes do without a copy_function
parameter: http://www.attrs.org/en/stable/api.html#attr.asdict. So taking that as an example, maybe we should omit it and add it later if the need arises.
And if we do keep it, I think I'd rather the default be None
(or lambda x: x)
, to just support straight assignment. That way we don't have to import copy
at all.
Thoughts?
@ericvsmith Thanks for review!
I'm not crazy about the
copy_function
parameter. Other thancopy.copy
andcopy.deepcopy
, what other parameters do you envision being used?
I have noticed that sometimes classes use custom copy methods. For example, in mypy several classes have copy_modified
.
So taking that as an example, maybe we should omit it and add it later if the need arises.
I think it is important to cover mutable value types. I could imagine that someone will be surprised by something like this:
@dataclass
class C:
x: List[int]
c = C([1, 2, 3])
c_dict = asdict(c)
c_dict['x'][1] = 42
assert c[1] == 2 # Fails
especially if there is no easy way to override this. To simplify docs I can move copy_function
to the end of the signatures.
And if we do keep it, I think I'd rather the default be
None
(orlambda x: x
), to just support straight assignment. That way we don't have to importcopy
at all.
I don't have a strong opinion here. If you think it is important to minimize number of imports, them I will make None
the default (which is equivalent to lambda x: x
).
Umm, does this actually need .asdict()
and astuple()
given the existence of vars()
? Since all values will just be in __dict__
as normal, vars()
should just work, as with normal classes. This might be fragile going forward, so maybe that's the reason?
>>> class CCC:
def __init__(self, a, b, c):
self.a = a
self.b = b
self.c = c
>>> z = CCC(1, 2, 3)
>>> vars(z)
{'a': 1, 'b': 2, 'c': 3}
namedtuple
needs .asdict()
because it doesn't put things in __dict__
:
>>> from collections import namedtuple
>>> x = namedtuple("AAA", "a b c")(1, 2, 3)
>>> x
AAA(a=1, b=2, c=3)
>>> vars(x)
Traceback (most recent call last):
File "<pyshell#3>", line 1, in <module>
vars(x)
TypeError: vars() argument must have __dict__ attribute
Classes which set default values as class attributes and don't write them onto an instance also suffer:
>>> class BBB:
a = 1
b = 2
c = 3
>>> y = BBB()
>>> vars(y)
{}
>>> y.__dict__
{}
If we do keep asdict()
, in my opinion it would be best to have it just do one thing and accept no parameters, like namedtuple._asdict()
.
I can't think of reasonable use-cases for astuple()
that aren't covered by tuple(x.asdict().values())
, assuming asdict()
returns an OrderedDict
. Since this also seems like something that would be used much less often, I'm leaning towards excluding it.
For me, the two big features of as_dict()
are:
Recursion: a dataclass containing a list of other dataclasses, for example, and being able to create nested dicts from that.
The ability to specify a different dict factory, specifically collections.OrderedDict
. It would be a shame if there were no way to create an OrderedDict
from an ordered collection of fields (which is what a dataclass is).
@ericvsmith, regarding those two use cases:
That sounds like serialization logic, and I understand serialization is currently considered out of scope. For similar reasons to serialization I don't think it needs to be baked into as_dict()
itself.
I suggest that it would always return an OrderedDict
, which is itself a dict
, just like namedtuple._asdict()
does. In the rare cases that actually require type == dict
, dict(as_dict(x))
is simple and clear.
Regarding the deep/shallow copy, I would also leave that out, i.e. have asdict()
always just do a simple shallow copy. Getting a deep copy (or any other special logic) is easy to achieve externally, e.g. copy.deepcopy(asdict(x))
, so I don't see a good reason to push this into the design.
I don't see how copy.deepcopy(asdict(x))
could do what a recursive asdict(x)
would do -- it would preserve the types of the values in the toplevel dict, not turn them into dicts recursively. A shallow asdict()
sounds like just a different spelling for x.__dict__
, which we don't really need.
I also don't really believe in the need for a dict factory: going forward the dict type itself will be guaranteed ordered, although it won't have the full OrderedDict logic, an I expect the usefulness of OrderedDict will decline. So I wonder if we shouldn't just have a single always-recursive asdict()
.
And so what if that smells like serialization logic, and serialization is out of scope. If serialization needs a recursive asdict()
it's better to provide one than to force people to invent their own.
This whole episode reminds me though -- while we take the stance that dataclasses are regular classes, we also have a way to recognize a dataclass (else the recursive form can't be implemented at all). This should IMO be a formal part of the protocol, and it would be nice if dataclasses and attrs could cooperate through this protocol. (I may be off base here, I haven't read the PEP draft in a long time. Which reminds me, we should get that PEP on python-dev and get it accepted.)
@gvanrossum: If I understand it correctly, I think the point with deepcopy is that if you have code like:
from dataclasses import *
import copy
# create a parent/child relationship, child has a dict field
@dataclass
class Child:
d: dict
@dataclass
class Parent:
child: Child
orig = Parent(Child({1:2}))
# create a recursive copy, but just copy fields
recursed = asdict(orig, nested=True, copy_function=None)
# these two are the same: the dict in the recursed 'child' is the same as the original object
print(orig, id(orig.child.d))
print(recursed, id(recursed['child']['d']))
# now create a deepcopy of the recursed object
deep = copy.deepcopy(recursed)
# and its child's dict is now different
print(deep, id(deep['child']['d']))
This produces the output:
Parent(child=Child(d={1: 2})) 4293494464
{'child': {'d': {1: 2}}} 4293494464
{'child': {'d': {1: 2}}} 4293194784
The first line is the orig
object, the second one is the recursed
copy, and the last one is the deepcopy of recursed
. The first two share the child "d" member, the last one has made a copy.
So, it seems to me that we should always just assign fields (no copying), and always just recurse, relying on copy.deepcopy()
to do any needed copying.
I'm not sure how we'd share the concept of "this is a dataclass" with "this is an attrs object", but I'll have to think about it.
I have some polishing yet to do on the PEP. I know you'd like to get the reviewing done quickly. As soon as some time frees up from my real job I'll get the updated PEP posted. Hopefully shortly after Thanksgiving.
Oops, @ilevkivskyi, I screwed up the merge by dropping this line from tst.py:
from typing import ClassVar, Any, List, Union
Unfortunately I don't know enough git/github to fix it. Can you put this line back? Sorry about that.
@ericvsmith No problem, I will fix today or tomorrow morning.
Another issue is just how "recursive" we want to be. Currently, the code in the PR walks in to fields that are other dataclasses, but not, for example, fields that are lists/dicts/tuples/etc. that contain dataclasses.
@dataclass
class Child:
i: int
@dataclass
class Parent:
children: List[Child]
o = Parent([Child(0), Child(1)])
print(asdict(o))
print(asdict(o, nested=True))
Both print statements produce {'children': [Child(i=0), Child(i=1)]}
. I think we'd want the second one to produce {'children': [{'i': 0}, {'i': 1}]}
.
But I'm not sure this is a tractable problem.
OK, I think I now understand the discussion of deepcopy, and I think there should not be a copy_function argument.
I also think there should not be a dict_factory argument -- the default dict is ordered and that's good enough.
I also want the recursive behavior to be the default and I also want it to recurse into lists, tuples and dicts. And yes, that's because that's the right thing for serialization. But why else would you use asdict() instead of __dict__
or vars()?
I also think there should not be a dict_factory argument -- the default dict is ordered and that's good enough.
That's a stronger guarantee than I've seen discussed recently. But if that's something we can rely on, I can also use it in the dataclass decorator, too. It does mean we can't officially support 3.6, although it would actually work in CPython 3.6, at least (by relying on the implementation detail of ordered dict).
If we restrict recursing in to just list, tuple, and dict, then it's simpler than supporting any container in general. I think that's fine, at least for the initial version. @ilevkivskyi: do you want to make these changes?
@gvanrossum, @ericvsmith, at the risk of trying your patience I'll make a final attempt to persuade you that recursing should be avoided.
My main argument hasn't changed -- this is going down the path of serialization logic. @gvanrossum wrote:
If serialization needs a recursive asdict() it's better to provide one than to force people to invent their own.
I argue that those who'll want to serialize dataclass instances will not be wanting for asdict(recursive=True)
.
In my experience serialization logic usually needs to be customized. The issue is that there are many small decisions which different use-cases will require different answers to, and many of these need to be an integral part of the serialization logic. Examples:
There are many more which we will run into now and in the future trying to implement asdict(recursive=True)
. Unlike with namedtuple._asdict()
, since a dataclass class can be any class, the number of edge-cases we'll run into is large.
As a reference I'll mention JSON serialization in Django, where the framework purposefully supplies a bare-bones DjangoJSONEncoder
and encourages sub-classing it to meet the needs of different use-cases:
Be aware that not all Django output can be passed unmodified to json. For example, if you have some custom type in an object to be serialized, you’ll have to write a custom json encoder for it.
Indeed I have seen such customized JSON encoders in most Django projects and they were all uniquely different. My interpretation of this is that the Django developers realized that serialization needs vary too greatly to be supported by a single tool, so they defined an interface for customizable serializers.
Another relevant reference is pickle, which is similar in that it supports the wide variety of corner cases inherent in serializing and de-serializing general Python objects. My feeling is that since dataclasses are just Python classes with few limitations, the complexity of writing a general serializer for them will be similarly large.
@ericvsmith @gvanrossum
I still think that we need dict_factory
argument. This is not only about dict
vs OrderedDict
, there are also popular third party containers, for example SortedDict
that provide even stronger guarantees than OrderedDict
. The same applies to tuple_factory
that can be used for conversion/compatibility to legacy namedtuple APIs.
Also I think we need copy_function
, this not only about copy
vs deepcopy
and not only about recursion. Many classes have their custom copy functions, as I already mentioned above. Another problem is that if we remove this argument, the default will be not satisfiable for many (see long thread about mutable default values and whether to copy them).
The logic behind these two arguments is simple: without these two arguments we can satisfy 50% of users, with them we can satisfy 90% of users, and the point is not about 2x more satisfied users, it is about 5x less unsatisfied users. For example two bullets by @taleinat will be fixed by these arguments.
I agree that we need to look inside built-in containers for recursive copy and with other changes proposed so far. To summarize:
nested=True
to be recursive by defaultcopy_function
to the end of signaturecopy_function=None
by defaultActually after some more thinking and playing with the implementation, I agree that copy_function
and nested
arguments are not needed. I left only one argument dict_factory
(and tuple_factory
for astuple
). As for custom copy function, this can be fixed by using __copy__
in many cases and for shallow dict
one can use vars
. For factory however, there are three arguments to keep it:
dict
, OrderedDict
, SortedDict
, named tuples, etc.attrs
have the same parameter, so it will be familiar for attrs
users.Currently the logic is easy to explain in one sentence: make a deep copy replacing dataclasses with dict_factory
. The additional bonus is that implementation is very simple. It is pushed in a new commit. Please review.
OK, I am going to stop arguing against dict_factory.
Do we care about cycles in the structure? E.g. parent -> child -> parent?
Pickling supports this (and other sharing), but JSON does not. Since
asdict() has nothing to do with pickling but is the starting point for
JSON, I propose we needn't worry about it (it would also wreak havoc with
other supplied functions such as __repr__
and __eq__
). I do think we
should be clear about this somewhere in the PEP.
(One option I can think of would be to exclude the parent pointer by not
making it a field. Another would be to have a way to exclude it from
asdict() through a similar mechanism as used to exclude a field from
__repr__
, __init__
or __cmp__
. Maybe we could use the repr=False
flag to mean that the field should be excluded from the dict? Or add an
extra dict=True
flag to field
?)
I'm still curious about the design of a protocol we could share with attrs.
@ericvsmith Thanks for review! I indeed forgot to add the check at the top, will fix this soon.
@gvanrossum
attrs
have an optional filter
argument for asdict()
with signature Callable[[Field], bool]
that say which fields should be excluded. I was thinking about adding it, but I am still not sure if it is often needed.
I think we can leave the filter out. This code is not rocket science; someone who needs a slight variation can easily roll their own.
I agree on making this as simple as possible, and leave it to others to create more elaborate serialization schemes.
@ilevkivskyi: I'll review the patch as soon as it's ready. Thanks for all of your work and patience.
@gvanrossum: I'll open an issue about finding a common protocol with attrs.
@ericvsmith OK, I fixed the problem and added corresponding tests.
Ups, I made a merge conflict (and few bugs). Now it should be resolved and ready to merge.
This PR adds two keyword arguments
copy_fields
andnested
toasdict()
andastuple()
. I think the first one should be on by default, so that simple mutable fields (like lists) will be copied byasdict()
andastuple()
.As well, as discussed in https://github.com/ericvsmith/dataclasses/pull/26, this adds
dict_factory
keyword argument, so that a user can useOrderedDict
or other custom mapping.