Improve asdict and astuple

ilevkivskyi commented 7 years ago

This PR adds two keyword arguments copy_fields and nested to asdict() and astuple(). I think the first one should be on by default, so that simple mutable fields (like lists) will be copied by asdict() and astuple().

As well, as discussed in https://github.com/ericvsmith/dataclasses/pull/26, this adds dict_factory keyword argument, so that a user can use OrderedDict or other custom mapping.

ilevkivskyi commented 7 years ago

After some thinking, I also added tuple_factory so that:

astuple() can used as an adaptor to a legacy named tuple API
this will be more "symmetric" with asdict()

ericvsmith commented 7 years ago

Thanks for the work! I'm still reviewing it.

But one quick note: Wouldn't a better name for copy_fields be shallow_copy?

ilevkivskyi commented 7 years ago

But one quick note: Wouldn't a better name for copy_fields be shallow_copy?

I now have an alternative idea: call it copy_function and make it accept a function (or None), so a user can opt-out of copying, or give copy.deepcopy or other custom function.

ilevkivskyi commented 7 years ago

OK, pushed new commit. Now a user can choose copy_function and dict_factory, the default are copy.copy and dict (and similar for tuples).

ericvsmith commented 7 years ago

Sorry to take so long to review this. I'm not crazy about the copy_function parameter. Other than copy.copy and copy.deepcopy, what other parameters do you envision being used? If there are only a few options, let's just support those instead of over-generalizing.

Note that attrs makes do without a copy_function parameter: http://www.attrs.org/en/stable/api.html#attr.asdict. So taking that as an example, maybe we should omit it and add it later if the need arises.

And if we do keep it, I think I'd rather the default be None (or lambda x: x), to just support straight assignment. That way we don't have to import copy at all.

Thoughts?

ilevkivskyi commented 7 years ago

@ericvsmith Thanks for review!

I'm not crazy about the copy_function parameter. Other than copy.copy and copy.deepcopy, what other parameters do you envision being used?

I have noticed that sometimes classes use custom copy methods. For example, in mypy several classes have copy_modified.

So taking that as an example, maybe we should omit it and add it later if the need arises.

I think it is important to cover mutable value types. I could imagine that someone will be surprised by something like this:

@dataclass
class C:
    x: List[int]

c = C([1, 2, 3])
c_dict = asdict(c)
c_dict['x'][1] = 42
assert c[1] == 2 # Fails

especially if there is no easy way to override this. To simplify docs I can move copy_function to the end of the signatures.

And if we do keep it, I think I'd rather the default be None (or lambda x: x), to just support straight assignment. That way we don't have to import copy at all.

I don't have a strong opinion here. If you think it is important to minimize number of imports, them I will make None the default (which is equivalent to lambda x: x).

taleinat commented 7 years ago

Umm, does this actually need .asdict() and astuple() given the existence of vars()? Since all values will just be in __dict__ as normal, vars() should just work, as with normal classes. This might be fragile going forward, so maybe that's the reason?

>>> class CCC:
    def __init__(self, a, b, c):
        self.a = a
        self.b = b
        self.c = c

>>> z = CCC(1, 2, 3)
>>> vars(z)
{'a': 1, 'b': 2, 'c': 3}

namedtuple needs .asdict() because it doesn't put things in __dict__:

>>> from collections import namedtuple
>>> x = namedtuple("AAA", "a b c")(1, 2, 3)
>>> x
AAA(a=1, b=2, c=3)
>>> vars(x)
Traceback (most recent call last):
  File "<pyshell#3>", line 1, in <module>
    vars(x)
TypeError: vars() argument must have __dict__ attribute

Classes which set default values as class attributes and don't write them onto an instance also suffer:

>>> class BBB:
    a = 1
    b = 2
    c = 3

>>> y = BBB()
>>> vars(y)
{}
>>> y.__dict__
{}

taleinat commented 7 years ago

If we do keep asdict(), in my opinion it would be best to have it just do one thing and accept no parameters, like namedtuple._asdict().

I can't think of reasonable use-cases for astuple() that aren't covered by tuple(x.asdict().values()), assuming asdict() returns an OrderedDict. Since this also seems like something that would be used much less often, I'm leaning towards excluding it.

ericvsmith commented 7 years ago

For me, the two big features of as_dict() are:

Recursion: a dataclass containing a list of other dataclasses, for example, and being able to create nested dicts from that.
The ability to specify a different dict factory, specifically collections.OrderedDict. It would be a shame if there were no way to create an OrderedDict from an ordered collection of fields (which is what a dataclass is).

taleinat commented 7 years ago

@ericvsmith, regarding those two use cases:

That sounds like serialization logic, and I understand serialization is currently considered out of scope. For similar reasons to serialization I don't think it needs to be baked into as_dict() itself.
I suggest that it would always return an OrderedDict, which is itself a dict, just like namedtuple._asdict() does. In the rare cases that actually require type == dict, dict(as_dict(x)) is simple and clear.

taleinat commented 7 years ago

Regarding the deep/shallow copy, I would also leave that out, i.e. have asdict() always just do a simple shallow copy. Getting a deep copy (or any other special logic) is easy to achieve externally, e.g. copy.deepcopy(asdict(x)), so I don't see a good reason to push this into the design.

gvanrossum commented 7 years ago

I don't see how copy.deepcopy(asdict(x)) could do what a recursive asdict(x) would do -- it would preserve the types of the values in the toplevel dict, not turn them into dicts recursively. A shallow asdict() sounds like just a different spelling for x.__dict__, which we don't really need.

I also don't really believe in the need for a dict factory: going forward the dict type itself will be guaranteed ordered, although it won't have the full OrderedDict logic, an I expect the usefulness of OrderedDict will decline. So I wonder if we shouldn't just have a single always-recursive asdict().

And so what if that smells like serialization logic, and serialization is out of scope. If serialization needs a recursive asdict() it's better to provide one than to force people to invent their own.

This whole episode reminds me though -- while we take the stance that dataclasses are regular classes, we also have a way to recognize a dataclass (else the recursive form can't be implemented at all). This should IMO be a formal part of the protocol, and it would be nice if dataclasses and attrs could cooperate through this protocol. (I may be off base here, I haven't read the PEP draft in a long time. Which reminds me, we should get that PEP on python-dev and get it accepted.)

ericvsmith commented 7 years ago

@gvanrossum: If I understand it correctly, I think the point with deepcopy is that if you have code like:

from dataclasses import *
import copy

# create a parent/child relationship, child has a dict field
@dataclass
class Child:
    d: dict

@dataclass
class Parent:
    child: Child

orig = Parent(Child({1:2}))

# create a recursive copy, but just copy fields
recursed = asdict(orig, nested=True, copy_function=None)

# these two are the same: the dict in the recursed 'child' is the same as the original object
print(orig, id(orig.child.d))
print(recursed, id(recursed['child']['d']))

# now create a deepcopy of the recursed object
deep = copy.deepcopy(recursed)

# and its child's dict is now different
print(deep, id(deep['child']['d']))

This produces the output:

Parent(child=Child(d={1: 2})) 4293494464
{'child': {'d': {1: 2}}} 4293494464
{'child': {'d': {1: 2}}} 4293194784

The first line is the orig object, the second one is the recursed copy, and the last one is the deepcopy of recursed. The first two share the child "d" member, the last one has made a copy.

So, it seems to me that we should always just assign fields (no copying), and always just recurse, relying on copy.deepcopy() to do any needed copying.

I'm not sure how we'd share the concept of "this is a dataclass" with "this is an attrs object", but I'll have to think about it.

I have some polishing yet to do on the PEP. I know you'd like to get the reviewing done quickly. As soon as some time frees up from my real job I'll get the updated PEP posted. Hopefully shortly after Thanksgiving.

ericvsmith commented 7 years ago

Oops, @ilevkivskyi, I screwed up the merge by dropping this line from tst.py:

from typing import ClassVar, Any, List, Union

Unfortunately I don't know enough git/github to fix it. Can you put this line back? Sorry about that.

ilevkivskyi commented 7 years ago

@ericvsmith No problem, I will fix today or tomorrow morning.

ericvsmith commented 7 years ago

Another issue is just how "recursive" we want to be. Currently, the code in the PR walks in to fields that are other dataclasses, but not, for example, fields that are lists/dicts/tuples/etc. that contain dataclasses.

@dataclass
class Child:
    i: int

@dataclass
class Parent:
    children: List[Child]

o = Parent([Child(0), Child(1)])
print(asdict(o))
print(asdict(o, nested=True))

Both print statements produce {'children': [Child(i=0), Child(i=1)]}. I think we'd want the second one to produce {'children': [{'i': 0}, {'i': 1}]}.

But I'm not sure this is a tractable problem.

gvanrossum commented 7 years ago

OK, I think I now understand the discussion of deepcopy, and I think there should not be a copy_function argument.

I also think there should not be a dict_factory argument -- the default dict is ordered and that's good enough.

I also want the recursive behavior to be the default and I also want it to recurse into lists, tuples and dicts. And yes, that's because that's the right thing for serialization. But why else would you use asdict() instead of __dict__ or vars()?

ericvsmith commented 7 years ago

I also think there should not be a dict_factory argument -- the default dict is ordered and that's good enough.

That's a stronger guarantee than I've seen discussed recently. But if that's something we can rely on, I can also use it in the dataclass decorator, too. It does mean we can't officially support 3.6, although it would actually work in CPython 3.6, at least (by relying on the implementation detail of ordered dict).

If we restrict recursing in to just list, tuple, and dict, then it's simpler than supporting any container in general. I think that's fine, at least for the initial version. @ilevkivskyi: do you want to make these changes?

taleinat commented 7 years ago

@gvanrossum, @ericvsmith, at the risk of trying your patience I'll make a final attempt to persuade you that recursing should be avoided.

My main argument hasn't changed -- this is going down the path of serialization logic. @gvanrossum wrote:

If serialization needs a recursive asdict() it's better to provide one than to force people to invent their own.

I argue that those who'll want to serialize dataclass instances will not be wanting for asdict(recursive=True).

In my experience serialization logic usually needs to be customized. The issue is that there are many small decisions which different use-cases will require different answers to, and many of these need to be an integral part of the serialization logic. Examples:

Additional container types: OrderedDict? defaultdict? namedtuple? deque? Queue? Non-stdlib?
Multiple references to an object: Make copies or use the same (possibly copied) object?
(dataclass-specific) What to do with hashable dataclass objects in sets or used as dict keys?

There are many more which we will run into now and in the future trying to implement asdict(recursive=True). Unlike with namedtuple._asdict(), since a dataclass class can be any class, the number of edge-cases we'll run into is large.

As a reference I'll mention JSON serialization in Django, where the framework purposefully supplies a bare-bones DjangoJSONEncoder and encourages sub-classing it to meet the needs of different use-cases:

Be aware that not all Django output can be passed unmodified to json. For example, if you have some custom type in an object to be serialized, you’ll have to write a custom json encoder for it.

Indeed I have seen such customized JSON encoders in most Django projects and they were all uniquely different. My interpretation of this is that the Django developers realized that serialization needs vary too greatly to be supported by a single tool, so they defined an interface for customizable serializers.

Another relevant reference is pickle, which is similar in that it supports the wide variety of corner cases inherent in serializing and de-serializing general Python objects. My feeling is that since dataclasses are just Python classes with few limitations, the complexity of writing a general serializer for them will be similarly large.

ilevkivskyi commented 7 years ago

@ericvsmith @gvanrossum

I still think that we need dict_factory argument. This is not only about dict vs OrderedDict, there are also popular third party containers, for example SortedDict that provide even stronger guarantees than OrderedDict. The same applies to tuple_factory that can be used for conversion/compatibility to legacy namedtuple APIs.

Also I think we need copy_function, this not only about copy vs deepcopy and not only about recursion. Many classes have their custom copy functions, as I already mentioned above. Another problem is that if we remove this argument, the default will be not satisfiable for many (see long thread about mutable default values and whether to copy them).

The logic behind these two arguments is simple: without these two arguments we can satisfy 50% of users, with them we can satisfy 90% of users, and the point is not about 2x more satisfied users, it is about 5x less unsatisfied users. For example two bullets by @taleinat will be fixed by these arguments.

I agree that we need to look inside built-in containers for recursive copy and with other changes proposed so far. To summarize:

Switch to nested=True to be recursive by default
Add code to also look inside built-in containers
Move copy_function to the end of signature
Switch to copy_function=None by default

ilevkivskyi commented 7 years ago

Actually after some more thinking and playing with the implementation, I agree that copy_function and nested arguments are not needed. I left only one argument dict_factory (and tuple_factory for astuple). As for custom copy function, this can be fixed by using __copy__ in many cases and for shallow dict one can use vars. For factory however, there are three arguments to keep it:

There are to many options: dict, OrderedDict, SortedDict, named tuples, etc.
It is very hard to insert such factory from the user side, if we don't have it as an argument.
attrs have the same parameter, so it will be familiar for attrs users.

Currently the logic is easy to explain in one sentence: make a deep copy replacing dataclasses with dict_factory. The additional bonus is that implementation is very simple. It is pushed in a new commit. Please review.

gvanrossum commented 7 years ago

OK, I am going to stop arguing against dict_factory.

Do we care about cycles in the structure? E.g. parent -> child -> parent? Pickling supports this (and other sharing), but JSON does not. Since asdict() has nothing to do with pickling but is the starting point for JSON, I propose we needn't worry about it (it would also wreak havoc with other supplied functions such as __repr__ and __eq__). I do think we should be clear about this somewhere in the PEP.

(One option I can think of would be to exclude the parent pointer by not making it a field. Another would be to have a way to exclude it from asdict() through a similar mechanism as used to exclude a field from __repr__, __init__ or __cmp__. Maybe we could use the repr=False flag to mean that the field should be excluded from the dict? Or add an extra dict=True flag to field?)

I'm still curious about the design of a protocol we could share with attrs.

ilevkivskyi commented 7 years ago

@ericvsmith Thanks for review! I indeed forgot to add the check at the top, will fix this soon.

@gvanrossum attrs have an optional filter argument for asdict() with signature Callable[[Field], bool] that say which fields should be excluded. I was thinking about adding it, but I am still not sure if it is often needed.

gvanrossum commented 7 years ago

I think we can leave the filter out. This code is not rocket science; someone who needs a slight variation can easily roll their own.

ericvsmith commented 7 years ago

I agree on making this as simple as possible, and leave it to others to create more elaborate serialization schemes.

@ilevkivskyi: I'll review the patch as soon as it's ready. Thanks for all of your work and patience.

@gvanrossum: I'll open an issue about finding a common protocol with attrs.

ilevkivskyi commented 7 years ago

@ericvsmith OK, I fixed the problem and added corresponding tests.

ilevkivskyi commented 7 years ago

Ups, I made a merge conflict (and few bugs). Now it should be resolved and ready to merge.

ericvsmith / dataclasses

Improve asdict and astuple #52