fluentpython / example-code-2e

Example code for Fluent Python, 2nd edition (O'Reilly 2022)
https://amzn.to/3J48u2J
MIT License
3.26k stars 919 forks source link

Hash Mixing #1

Closed MattEding closed 4 years ago

MattEding commented 4 years ago

In the 1st edition you mention that you use XOR to mix hashes for Vector2d since the Python documentation recommended that approach at the time. This is no longer the case; __hash__ now states

"It is advised to mix together the hash values of the components of the object that also play a part in comparison of objects by packing them into a tuple and hashing the tuple"

So the following code snippet should be changed from:

def __hash__(self):
    return hash(self.x) ^ hash(self.y)

To:

def __hash__(self):
    return hash((self.x, self.y))
ramalho commented 4 years ago

Thank you so much, @MattEding! I made the fix you suggest in the Vector2d examples in Chapter 11 ("A Pythonic Object"—was chapter 9 in 1st edition). I need to consider what to do with N-dimensional vector examples in later chapters. For them, I used a generator exepression and functools.reduce with operator.xor on the elements, which computes the hash lazily, so it is memory efficient:

def __hash__(self):
       hashes = (hash(x) for x in self)
       return functools.reduce(operator.xor, hashes, 0)

It seems wasteful to build a tuple to compute the hash for the N-dimensional Vector class. It may have many elements, which will be duplicated in the tuple, only to be immediately thrown away.

If you have a suggestion for those cases, I'd be glad to read it. Thank you!

MattEding commented 4 years ago

Glad to have provided helpful information! Fluent Python was my first computer science book and still holds the crown for being my favorite!

Regarding the N-dimensional Vector, below are some options I came up with:

# keeping it simple
def __hash__(self):
    return hash(tuple(self))

# this is the slowest variant, but best memory usage
def __hash__(self):
    return functools.reduce(operator.xor, (hash(x) for x in self), 0)

# this is significantly faster than the two above variants
def __hash__(self):
    return hash(memoryview(self._components).tobytes())

# since hash collisions are expected anyway, how about sampling 
# the data if it is above a threshold length?
def __hash__(self):
    stride = (len(self) // self._maxhashlen) or 1
    return hash(tuple(self._components[::stride]))

# caching for future calls
def __hash__(self):
    try:
        return self._hash
    except AttribruteError:
        self._hash = hash(tuple(self))
        return hash(self)

Of course the last two versions can be mix and matched with the tuple / memoryview / xor variants. Of course these gymnastics to get hash to work properly could lead to a discussion about why using non-hashable objects under the hood is problematic.

To keep you from the same dead end, here was my failed attempt trying to utilize the underlying buffer:

>>> from array import array
>>> arr = array('d', range(3))
>>> mv = memoryview(arr)
>>> hash(mv)
ValueError: cannot hash writable memoryview object
>>> hash(mv.toreadonly()) # python 3.8
ValueError: memoryview: hashing is restricted to formats 'B', 'b' or 'c'
>>> hash(mv.cast('b').toreadonly())
TypeError: unhashable type: 'array.array'
MattEding commented 4 years ago

Ok so here is an possible implementation for the N-dimensional Vector that can hash effectively without any compromises.

Additionally, as per your request in the Soapbox A safe format, with Enhanced Usability, I have included an implementation that limits output. I ended up using a private method in reprlib.aRepr to avoid headaches of reinventing the wheel. It looks like it has been unchanged in all version branches on CPython's github, and I assume it is safe (enough) to use.

As a side-note: math.hypot(*self) may be a worse choice than math.sqrt(sum(x * x for x in self)) due to it unpacking into a tuple rather than lazy evaluation. But it is faster from empirical evidence from trying large Vector sizes.

from array import array
from reprlib import aRepr
import itertools
import math
import numbers
import re
import sys

class Vector:
    typecode = 'd'

    def __init__(self, components):
        self._bytes = array(self.typecode, components).tobytes()
        self._memv = memoryview(self._bytes).cast(self.typecode)

    def __iter__(self):
        return iter(self._memv)

    def __repr__(self):
        components = aRepr.repr_list(self, aRepr.maxlevel)
        # avoid hardcoded name
        class_name = type(self).__name__
        # f-string 3.6
        return f'{class_name}({components})'

    def __str__(self):
        return str(tuple(self))

    def __bytes__(self):
        return bytes([ord(self.typecode)]) + self._bytes

    def __eq__(self, other):
        return (len(self) == len(other) and
                all(a == b for a, b in zip(self, other)))

    def __hash__(self):
        # chose this over `hash(bytes(self))` since __bytes__
        # is very slow with concatenation
        return hash((self.typecode, self._bytes))

    def __abs__(self):
        # math.hypot 3.8
        return math.hypot(*self)

    def __bool__(self):
        return bool(abs(self))

    def __len__(self):
        return len(self._bytes) // array(self.typecode).itemsize

    def __getitem__(self, index):
        cls = type(self)
        if isinstance(index, slice):
            return cls(self._memv[index])
        elif isinstance(index, numbers.Integral):
            return self._memv[index]
        else:
            # f-string 3.6
            msg = f'{cls.__name__} indices must be integers'
            raise TypeError(msg)

    shortcut_names = 'xyzt'

    def __getattr__(self, name):
        cls = type(self)
        if len(name) == 1:
            pos = cls.shortcut_names.find(name)
            # delegate to self.__len__
            if 0 <= pos < len(self):
                # delegate to self.__getitem__
                return self[pos]
        # f-string 3.6
        msg = f'{cls.__name__!r} object has no attribute {name!r}'
        raise AttributeError(msg)

    def angle(self, n):
        # math.hypot 3.8
        r = math.hypot(*self[n:])
        a = math.atan2(r, self[n-1])
        if (n == len(self) - 1) and (self[-1] < 0):
            return math.pi * 2 - a
        else:
            return a

    def angles(self):
        return (self.angle(n) for n in range(1, len(self)))

    fmt_default_limit = 30

    # `format` ignores default values for `fmt_spec`
    # "The default format_spec is an empty string"
    # https://docs.python.org/3/library/functions.html#format
    def __format__(self, fmt_spec):
        endwith_star_digits = r'(?P<polar>h)?(?P<length>\*|\d*)$'
        regex = re.compile(endwith_star_digits)
        match = regex.search(fmt_spec)

        length = match.group('length') or self.fmt_default_limit
        maxiter = sys.maxsize if length == '*' else int(length)

        if match.group('polar'):
            left, right = '<>'
            it = itertools.chain([abs(self)], self.angles())
            it = itertools.islice(it, maxiter + 1)

            class SizedIteratorClosure:
                def __len__(_):
                    return len(self)

                def __iter__(_):
                    return it

            # can do `coords = tuple(it)` if copying data is acceptable
            coords = SizedIteratorClosure()
        else:
            left, right = '()'
            coords = self
        # a protected method, but "Python is a language for consenting adults."
        return aRepr._repr_iterable(coords, aRepr.maxlevel, left, right, maxiter)

    @classmethod
    def frombytes(cls, octets):
        typecode = chr(octets[0])
        memv = memoryview(octets[1:]).cast(typecode)
        return cls(memv)

To compare the memory usage of using bytes and memoryview instead of an array or other sequences:

def compare_sizeof(n, typecode='d'):
    arr = array(typecode, range(n))
    mv = memoryview(arr)
    byt = bytes(mv)
    tpl = tuple(mv)
    lst = list(mv)
    return {type(obj).__name__: sys.getsizeof(obj) for obj in [arr, mv, byt, tpl, lst]}

As Vector's size increases, this implementation actually uses less memory than the original.