Closed MattEding closed 4 years ago
Thank you so much, @MattEding! I made the fix you suggest in the Vector2d examples in Chapter 11 ("A Pythonic Object"—was chapter 9 in 1st edition). I need to consider what to do with N-dimensional vector examples in later chapters. For them, I used a generator exepression and functools.reduce with operator.xor on the elements, which computes the hash lazily, so it is memory efficient:
def __hash__(self):
hashes = (hash(x) for x in self)
return functools.reduce(operator.xor, hashes, 0)
It seems wasteful to build a tuple to compute the hash for the N-dimensional Vector class. It may have many elements, which will be duplicated in the tuple, only to be immediately thrown away.
If you have a suggestion for those cases, I'd be glad to read it. Thank you!
Glad to have provided helpful information! Fluent Python was my first computer science book and still holds the crown for being my favorite!
Regarding the N-dimensional Vector, below are some options I came up with:
# keeping it simple
def __hash__(self):
return hash(tuple(self))
# this is the slowest variant, but best memory usage
def __hash__(self):
return functools.reduce(operator.xor, (hash(x) for x in self), 0)
# this is significantly faster than the two above variants
def __hash__(self):
return hash(memoryview(self._components).tobytes())
# since hash collisions are expected anyway, how about sampling
# the data if it is above a threshold length?
def __hash__(self):
stride = (len(self) // self._maxhashlen) or 1
return hash(tuple(self._components[::stride]))
# caching for future calls
def __hash__(self):
try:
return self._hash
except AttribruteError:
self._hash = hash(tuple(self))
return hash(self)
Of course the last two versions can be mix and matched with the tuple
/ memoryview
/ xor
variants. Of course these gymnastics to get hash
to work properly could lead to a discussion about why using non-hashable objects under the hood is problematic.
To keep you from the same dead end, here was my failed attempt trying to utilize the underlying buffer:
>>> from array import array
>>> arr = array('d', range(3))
>>> mv = memoryview(arr)
>>> hash(mv)
ValueError: cannot hash writable memoryview object
>>> hash(mv.toreadonly()) # python 3.8
ValueError: memoryview: hashing is restricted to formats 'B', 'b' or 'c'
>>> hash(mv.cast('b').toreadonly())
TypeError: unhashable type: 'array.array'
Ok so here is an possible implementation for the N-dimensional Vector that can hash effectively without any compromises.
Additionally, as per your request in the Soapbox A safe format, with Enhanced Usability, I have included an implementation that limits output. I ended up using a private method in reprlib.aRepr
to avoid headaches of reinventing the wheel. It looks like it has been unchanged in all version branches on CPython's github, and I assume it is safe (enough) to use.
As a side-note: math.hypot(*self)
may be a worse choice than math.sqrt(sum(x * x for x in self))
due to it unpacking into a tuple rather than lazy evaluation. But it is faster from empirical evidence from trying large Vector sizes.
from array import array
from reprlib import aRepr
import itertools
import math
import numbers
import re
import sys
class Vector:
typecode = 'd'
def __init__(self, components):
self._bytes = array(self.typecode, components).tobytes()
self._memv = memoryview(self._bytes).cast(self.typecode)
def __iter__(self):
return iter(self._memv)
def __repr__(self):
components = aRepr.repr_list(self, aRepr.maxlevel)
# avoid hardcoded name
class_name = type(self).__name__
# f-string 3.6
return f'{class_name}({components})'
def __str__(self):
return str(tuple(self))
def __bytes__(self):
return bytes([ord(self.typecode)]) + self._bytes
def __eq__(self, other):
return (len(self) == len(other) and
all(a == b for a, b in zip(self, other)))
def __hash__(self):
# chose this over `hash(bytes(self))` since __bytes__
# is very slow with concatenation
return hash((self.typecode, self._bytes))
def __abs__(self):
# math.hypot 3.8
return math.hypot(*self)
def __bool__(self):
return bool(abs(self))
def __len__(self):
return len(self._bytes) // array(self.typecode).itemsize
def __getitem__(self, index):
cls = type(self)
if isinstance(index, slice):
return cls(self._memv[index])
elif isinstance(index, numbers.Integral):
return self._memv[index]
else:
# f-string 3.6
msg = f'{cls.__name__} indices must be integers'
raise TypeError(msg)
shortcut_names = 'xyzt'
def __getattr__(self, name):
cls = type(self)
if len(name) == 1:
pos = cls.shortcut_names.find(name)
# delegate to self.__len__
if 0 <= pos < len(self):
# delegate to self.__getitem__
return self[pos]
# f-string 3.6
msg = f'{cls.__name__!r} object has no attribute {name!r}'
raise AttributeError(msg)
def angle(self, n):
# math.hypot 3.8
r = math.hypot(*self[n:])
a = math.atan2(r, self[n-1])
if (n == len(self) - 1) and (self[-1] < 0):
return math.pi * 2 - a
else:
return a
def angles(self):
return (self.angle(n) for n in range(1, len(self)))
fmt_default_limit = 30
# `format` ignores default values for `fmt_spec`
# "The default format_spec is an empty string"
# https://docs.python.org/3/library/functions.html#format
def __format__(self, fmt_spec):
endwith_star_digits = r'(?P<polar>h)?(?P<length>\*|\d*)$'
regex = re.compile(endwith_star_digits)
match = regex.search(fmt_spec)
length = match.group('length') or self.fmt_default_limit
maxiter = sys.maxsize if length == '*' else int(length)
if match.group('polar'):
left, right = '<>'
it = itertools.chain([abs(self)], self.angles())
it = itertools.islice(it, maxiter + 1)
class SizedIteratorClosure:
def __len__(_):
return len(self)
def __iter__(_):
return it
# can do `coords = tuple(it)` if copying data is acceptable
coords = SizedIteratorClosure()
else:
left, right = '()'
coords = self
# a protected method, but "Python is a language for consenting adults."
return aRepr._repr_iterable(coords, aRepr.maxlevel, left, right, maxiter)
@classmethod
def frombytes(cls, octets):
typecode = chr(octets[0])
memv = memoryview(octets[1:]).cast(typecode)
return cls(memv)
To compare the memory usage of using bytes and memoryview instead of an array or other sequences:
def compare_sizeof(n, typecode='d'):
arr = array(typecode, range(n))
mv = memoryview(arr)
byt = bytes(mv)
tpl = tuple(mv)
lst = list(mv)
return {type(obj).__name__: sys.getsizeof(obj) for obj in [arr, mv, byt, tpl, lst]}
As Vector's size increases, this implementation actually uses less memory than the original.
In the 1st edition you mention that you use XOR to mix hashes for
Vector2d
since the Python documentation recommended that approach at the time. This is no longer the case;__hash__
now statesSo the following code snippet should be changed from:
To: