ARM-software / devlib

Library for interaction with and instrumentation of remote devices.
Apache License 2.0
45 stars 77 forks source link

Surprising memoized behavior for mutable containers #341

Open douglas-raillard-arm opened 5 years ago

douglas-raillard-arm commented 5 years ago

The following piece of code will only print once, although foo is called with arguments that would not compare equal:

from devlib.utils.misc import memoized

@memoized
def foo(l):
    print('hello')

l = []
l.append(3)

foo(l)

l.pop()
l.append(4)

foo(l)

That will also only print once, although C and D are wildly different classes:

class C:
    def __init__(self):
        self.val1 = 42

    def foo(self):
        pass

class D:
    def __init__(self):
        self.val2 = list(range(0, 100))

    def bar(self):
        pass

foo(C())
foo(D())

The root cause is __get_memo_id that only checks the first 32 bytes of the underlying representation of a PyObject, in addition to the id() of the object. IDs can be reallocated, and the same object will not always compare equal to a past state of itself (generally true for any immutable object). Similar issues will arise with classes implementing __eq__ without implementing __hash__, or with naive but nonetheless valid implementations like:

def __hash__(self): return 42

This is allowed by:

The only required property is that objects which compare equal have the same hash value https://docs.python.org/3/reference/datamodel.html#object.__hash__

Generally speaking, that issue cannot generally be solved, apart from only using memoized with hashable types (which rules out lists). A potential workaround would be to use the repr() of the object, hoping that it will reflect the full state of the object.

Another workaround can be to keep a reference to the parameters, so they are never destructed and we ensure that if objects are not the same, they will get a different ID. This would make memoized relatively useless for types likes lists, where calls like foo([1,2]) would not be memoized anymore.

Note: Python 3 functools.lru_cache avoids that issue by not supporting non-hashable types. https://docs.python.org/3/library/functools.html#functools.lru_cache

douglas-raillard-arm commented 5 years ago

The case where completely different objects are assumed to be the same may be improved by looking at least at the number of bytes that can be expected to differ:

import sys
# Empty class, with instances as small as they can usually be
class C:
    pass
c=C()

# Returns 56 on my 3.7 implementation, which is already larger than the 32 used by `__get_memo_id`
sys.getsizeof(c)
setrofim commented 5 years ago

Yes, this a know issue (read: "expected behaviour"). The reason the first 32 bytes are taken as part of __get_memo_id is to deal with ID reallocation problem. Mutable objects is another problem altogether, as in this case, it is actually the same object rather than the ID being reused.

The 32 bytes value was chosen arbitrarily as a "reasonable" trade-off between map key size and likelihood of collisions. Again, the goal was detecting ID reuse, not changes to mutable objects.

Generally speaking, that issue cannot generally be solved, apart from only using memoized with hashable types (which rules out lists).

Yes, however that will make memoized unusable in a lot of cases, and will require calling code to be a lot more awkward. This was considered, but was quickly rejected as impractical.

A potential workaround would be to use the repr() of the object, hoping that it will reflect the full state of the object.

This is how it was previously implemented (though you would again have to truncate at some point to avoid potentially keeping a large number of large strings perpetually in memory). In practice, repr() actually resulted in more collisions.

Another workaround can be to keep a reference to the parameters, so they are never destructed and we ensure that if objects are not the same, they will get a different ID.

As you say, that would make memoized a lot less usefully, and again may results in having to keep a lot of large objects perpetually in memory.

There are only three possible actual solutions to this problem

1) only allow hashable types as arguments for memoized functions 2) maintain a reference to every argument indefinitely 3) hash the memory of each non-hashable argument on every invocation 4) (bonus) modify Python itself to keep track of when a mutable object has been mutated past an arbitrary point, e. g. by adding an event/flag to PyObject.

None of these are workable for reasons that are either mentioned above, or are, hopefully, self-evident. Which means we're left with some non-perfect workaround. Current implementation of __get_memo_id is the best one I could come up with thusfar, but I very much welcome suggestions for an improved version (or a different solution entirely).

We can, however, at least update the documentation for memoized to make this limitation more obvious.

douglas-raillard-arm commented 5 years ago

The 32 bytes value was chosen arbitrarily as a "reasonable" trade-off between map key size and likelihood of collisions. Again, the goal was detecting ID reuse, not changes to mutable objects.

As proposed in the 2nd comment, 32 bytes does not seem reasonable since they can be identical on objects that have nothing in common, as shown in my second example. This is probably largely due to common fields at the beginning of most objects (at least 16 bytes for PyObject AFAICS in [1]). A better heuristic could look like my previous comment using sys.getsizeof on an "empty" instance of a user defined class:

>>> import sys
>>> sys.getsizeof(object()) # That seems to match CPython's PyObject definition 
16
>>> class C: pass
... 
>>> sys.getsizeof(C())
56

[1] https://github.com/python/cpython/blob/master/Include/object.h#L109

Out of all the solutions, maintaining a reference on objects and indexing by ID would always yield a functionally correct result. Most use cases of memoized only deal with basic things like int or small lists, so it would not waist that much memory, and would avoid the current hacks. The memory consumption issue is sidestepped in functools.lru_cache by caching a limited number of results by default. A large number like 500 would be enough to memoized 99% of the time, while avoiding out of memory issues for long-running processes. This could be documented along the fact that memoized does not use an equality check but an identity check, which may result in "less memoization" than expected.

Yes, however that will make memoized unusable in a lot of cases, and will require calling code to be a lot more awkward. This was considered, but was quickly rejected as impractical.

One could argue that although using the ID allows to use it with for example lists, it does not mean it will actually be useful. A call like foo([1,2]) will not be memoized if somebody does the same call again, since the list will probably get a different ID. AFAIK, the main non-hashable types are a few types like list and dict. "Big" objects like big custom classes are hashable by default, so it would be useable with them. This means using the ID brings the ability to memoize calls where the parameters are likely to get new IDs every time (list comprehensions, literal list values etc). It is also not functionally correct for the types that it was introduced for. It is highly unlikely that the memoized function will not actually depend on the values inside the list. If the list it is called on is modified, the functions should be called again (this can lead to quite nasty bugs otherwise). However, it prevents memoization in many useful cases, like foo(257) or `foo( (1,2) ):

# CPython has a pool of preallocated objects for integers <= 256, but after that it creates independent instances
def f1():
    return 257

def f2():
    return 257
# also works with plain a=257; b=257 in an interactive interpreter
assert not f1() is f2()

To summarize, the following behavior would be functionally correct while allowing flexible calls:

(bonus) Getting the "generalized ID" can be somehow achieved by calling recursively id() on objects returned by gc.get_referents(), as well as their type, and hashing all the values into a single ID number :-)

douglas-raillard-arm commented 5 years ago

EDIT: The key should probably also include the argument types, unless we are confident foo(1.0) will return the same as foo(1). That comes from the fact that 1 == 1.0.

setrofim commented 5 years ago

Only memoize calls involving hashable objects. [...] If that highly matters to the caller, hashable types like tuples() should be used where possible.

This what I meant when I said code gets "awkward". Our code actually has a lot of instances where we have lists that aren't mutated, or aren't mutated after a certain point (pretty much all ordered sequences in in WA configuration processing are lists, and once config gets finalized they typically don't get touched). So the "edge" case where a memoized function gets called with the same non-hashable object which could, but actually hasn't changed, is quite common. There is a number of reasons why lists are used instead of tuples, and simply converting everything to tuples isn't really an option. Doing the conversion just before calling a memoized function is in theory possible, but will be difficult to track (in this case, raising a la lru_cache would almost be preferable).

(bonus) Getting the "generalized ID" can be somehow achieved by calling recursively id() on objects returned by gc.get_referents(), as well as their type, and hashing all the values into a single ID number :-)

ooh, I do like that idea. It's worth playing around with.