biqqles / dataclassy

A fast and flexible reimplementation of data classes
https://pypi.org/project/dataclassy
Mozilla Public License 2.0
81 stars 9 forks source link

Internals are not considered in __eq__ #53

Closed thisisrandy closed 3 years ago

thisisrandy commented 3 years ago

Per the docs (emphasis mine),

If true (the default), generate an eq method that compares this data class to another of the same type as if they were tuples created by as_tuple.

However, it appears that internals are ignored, just as they are in __repr__ when hide_internals=True:

In [1]: from dataclassy import dataclass, as_tuple

In [2]: from typing import Optional

In [3]: @dataclass
   ...: class Foo():
   ...:     _private: Optional[int] = None
   ...: 

In [4]: Foo() == Foo(5)
Out[4]: True

In [5]: as_tuple(Foo()) == as_tuple(Foo(5))
Out[5]: False

In [6]: @dataclass
   ...: class Bar():
   ...:     public: Optional[int] = None
   ...: 

In [7]: Bar() == Bar(5)
Out[7]: False

This may have been intentional, but it is also surprising, given that

In [1]: from dataclasses import dataclass

In [2]: from typing import Optional

In [3]: @dataclass
   ...: class Foo():
   ...:     _private: Optional[int] = None
   ...: 

In [4]: Foo() == Foo(5)
Out[4]: False

and especially given that I can do this:

In [1]: from dataclassy import dataclass

In [2]: from typing import Optional

In [3]: @dataclass(hide_internals=False)
   ...: class Baz():
   ...:     _private: Optional[int] = None
   ...: 

In [4]: Baz() == Baz(5)
Out[4]: True

In [5]: repr(Baz()) == repr(Baz(5))
Out[5]: False

which just feels inconsistent to me (__repr__ may choose to hide some internal state, so (a == b) == False but (repr(a) == repr(b)) == True seems fine, but not the inverse, as above).

As for the why, it appears that at dataclass.py:147, inside DataClassMeta.__init__, you decline to pass internals to fields, so the default of False is used when creating the expression for __tuple__, which is of course what is used inside of __eq__.

        tuple_expr = ', '.join((*(f'self.{f}' for f in fields(cls)), ''))  # '' ensures closing comma
        cls.__tuple__ = property(eval(f'lambda self: ({tuple_expr})'))
def fields(dataclass: Type[DataClass], internals=False) -> Dict[str, Type]:
def __eq__(self: DataClass, other: DataClass):
    return type(self) is type(other) and self.__tuple__ == other.__tuple__

However, as_tuple has no concept of internals and in general functions quite differently.

I see two ways to fix this:

  1. Pass internals=False to fields at dataclass.py:147, or make it configurable with a decorator param, something like consider_internals_eq=True. I think this should be True by default so as to not be obscurely inconsistent with dataclasses. I don't think this should be tied to hide_internals, since one may not want internals returned from __repr__ but still want them considered in __eq__.
  2. Use as_tuple in __eq__. This seems right because it unifies two things that seem intuitively the same. However, your comments mention "efficient representation for internal methods" re: __tuple__. I'm not sure offhand how much more efficient evaluating a static expression is v. recursing through a structure is, but if the difference is significant, maybe unification isn't the way to go.

Please let me know which you prefer and why.

biqqles commented 3 years ago

I think the problem is simply that I forgot to set internals to True when generating __tuple__. Is there a situation you would ever not want internal fields to be included in comparisons?

For your question about the performance penalty of as_tuple, I've never measured it, but I'm sure it's significant. If you look at _recurse_structure, it does a lot of work in unpacking and copying arbitrarily nested structures, most of which is irrelevant or actually harmful for the purposes of comparison. The analogy of as_tuple to the comparison methods I guess is technically wrong, but I thought it was accurate enough, if I had remembered to set fields=True.

thisisrandy commented 3 years ago

I think the problem is simply that I forgot to set internals to True when generating tuple. Is there a situation you would ever not want internal fields to be included in comparisons?

I've definitely written classes where I keep e.g. derived state or things like timestamps that I might want to ignore vis-a-vis equality in internal fields, but I'm not sure that any of them were also dataclasses, i.e. fit the model (from PEP 557) of "classes which exist primarily to store values which are accessible by attribute lookup."

In lieu of a specific use case, I can see how making this configurable might be viewed as just cluttering the API. It sounds like this qualifies as a bug from your perspective, so I'd be just as happy to see it fixed by always passing internals=True to fields. However, I'll caution, as I did in #54, that it would be a breaking change, and that, intentionally or not, one or your tests actually relied on the bug. If you're confident that none of your users are relying on it (probably unknowable, I realize, but confident enough, in any event), then I'm happy to make the change and fix the test.

biqqles commented 3 years ago

intentionally or not, one or your tests actually relied on the bug

Where does this happen? I see you applied the new option to Epsilon so I assume it's to do with that.

thisisrandy commented 3 years ago
        @dataclass  # a complex (nested) class
        class Epsilon:
            g: Tuple[NT]
            h: List['Epsilon']
            _i: int = 0
        self.e = self.Epsilon((self.NT(1, 2, 3)), [self.Epsilon(4, 5, 6)])
    def test_values(self):
        """Test values()."""
        self.assertEqual(values(self.e), dict(g=self.NT(1, 2, 3), h=[self.Epsilon(4, 5)]))
        self.assertEqual(values(self.e, True), dict(g=self.NT(1, 2, 3), h=[self.Epsilon(4, 5)], _i=0))

Once Epsilon._i is considered, both assertions break.

biqqles commented 3 years ago

Ah OK, I see what you mean. This may be a stupid idea, but I kind of wonder if the comparison behaviour should be tied to hide_internals. When running that test I was confused for a minute because the reprs of those two objects are identical, even though they compare differently. "Forcing" the user to render visible the reason for two objects being different might be a good idea. As an added bonus, hide_internals is True by default, which would match the current behaviour, making it less of a breaking change.

thisisrandy commented 3 years ago

I originally said I thought they should not be tied together, but after getting briefly confused about why the test was failing earlier (because I initially missed the _i=6 in the nested Epsilon and it wasn't showing up in the repr), I'm warming up to the idea.

However, I don't think "hide" is the right verb for the semantics of not using internals in comparison. How about deprecating hide_internals and aliasing it to use_internals?

As an added bonus, hide_internals is True by default, which would match the current behavior, making it less of a breaking change.

One partial definition of "breaking change," in my view, is any API change that isn't strictly additive. Even if the default value of hide_internals doesn't change, the proposition here is changing its semantics, which, unfortunately, still qualifies as breaking :disappointed:. If we believe that internals should be considered in comparison by default, regardless of how that's configured, there's no way to avoid it breaking the one person out there who wrote assert Episilon(4, 5, 6) == Epsilon(4, 5) at the top of their munge_greeks method

biqqles commented 3 years ago

However, I don't think "hide" is the right verb for the semantics of not using internals in comparison.

Surprisingly, "hide" makes perfect sense to me here, in the sense of "to obscure from the outside world".

Maybe it is still not absolutely clear, but I can't see any verb being so brilliant that it needs no explanation. I dislike use as it's too vague.

One partial definition of "breaking change," in my view, is any API change that isn't strictly additive.

Yes, I was very loose with my usage of my term, what I really meant was "reducing the amount of code out there actually broken by this change." If we linked comparison etc. to hide_internals being false, it would only affect people who set it to true and had internal fields and didn't want them involved in the comparison. Which is still possibly some people but not as many as if we were to change the default behaviour (and this would really be a pain to diagnose if you don't follow dataclassy's release notes).

I think this should be True by default so as to not be obscurely inconsistent with dataclasses.

I actually thought of at some point changing hide_internals to be False by default for this reason, but this would be a separate breaking change.

thisisrandy commented 3 years ago

Surprisingly, "hide" makes perfect sense to me here, in the sense of "to obscure from the outside world".

I don't want to get into a capricious semantics argument here, but (ha) hidden or obscured things can still very much and often do exert effects. That said, I agree that "use" is also insufficient, and this is ultimately your package, so as long as the docs are explicit, I'm happy with whatever you choose.

As an added bonus, hide_internals is True by default, which would match the current behaviour, making it less of a breaking change.

I think I may have misinterpreted this. Were you saying that, as hide_internals is true by default, tying it to __tuple__ generation doesn't change the default behavior? You would of course be right, though that conflicts with the idea of considering internals in comparison by default, which is probably whence my confusion.

To sum up, I'm fine with tying this behavior to an existing config variable, provided its effects are well documented. I would like for it to compare internals by default, because that follows the behavior of the model (dataclasses), and because not comparing internals seems unlikely to have been a relied upon feature, but I understand if you want to introduce it as a non-breaking change.

biqqles commented 3 years ago

Were you saying that, as hide_internals is true by default, tying it to __tuple__ generation doesn't change the default behavior? You would of course be right, though that conflicts with the idea of considering internals in comparison by default, which is probably whence my confusion.

Yes. My rough plan based on this discussion is

1) Tie hide_internals to __tuple__ in v0.11 2) Likely in v0.12, change hide_internals to be False by default

I think breaking it down like this is useful because it's easier for someone to predict how the behaviour of their code will change.

thisisrandy commented 3 years ago

Sounds great, thanks. I made a PR for your convenience