I found that det_hash can sometimes hash equal objects differently. This was causing me a lot of grief when trying to manually initializing a step and retrieve its cached results. I feel that the default hashing in Tango should (at least) respect __eq__ for built-in types and fallback to Pickle serialization for more complex and custom data structures.
Examples
from tango.common.det_hash import det_hash
x = ["bert", "bert"]
y = ["bert", "bert "[:-1]]
assert x == y
assert det_hash(x) == det_hash(y) # AssertionError
# Also, suppose you invoke this script as `python script.py bert bert`
assert x == sys.argv[1:]
assert det_hash(x) == det_hash(sys.argv[1:]) # AssertionError
# I understand that this is because the object hierarchies may be different for `x` and `y`.
x_addr = [id(i) for i in x]
y_addr = [id(i) for i in y]
assert x_addr[0] == x_addr[1]
assert y_addr[0] == y_addr[1] # AssertionError
# Therefore, `x` and `y` pickle differently.
import dill
assert dill.dumps(x) == dill.dumps(y) # AssertionError
Solutions
In Python, object hashes (hash or __hash__) are indeed expected to respect equality: i.e.
x == y implies both that x is y and hash(x) == hash(y)
However, there are a few problems with using __hash__:
mutable data types are not supposed to implement __hash__
not all classes will implement __hash__
determinism can only be toggled when initializing the interpreter
I propose the following implementation:
At hash time in Tango, we just want the object's instantaneous hash and do not actually care whether objects are immutable. So we can manually hash the mutable built-in types. Also, since mutable types may be nested, we can hash recursively.
We can simply fall back to pickle serialization for objects that are not hashable. I am assuming that hash collisions between __hash__ and pickle are negligible.
Two suggestions:
3.1. We could write a simple C extension to temporarily disable hash determinism during Tango's hashing function. (Note: I'm not personally familiar with cpython, am just assuming that this variable can be changed.) Then, we will have deterministic hashing for all types that implement __hash__.
3.2. Alternatively, we could write a custom function with deterministic hashing logic for all built-in Python types. This would be recursive (like rec_hash above) to handle nested data structures. But custom classes that do implement __hash__ will not be supported.
Together (with 3.1) that's:
def rec_hash(o: Any) -> str:
if isinstance(o, collections.abc.Sequence): # tuples, lists, ranges
return hash((rec_hash(x) for x in o))
elif isinstance(o, set):
# set elements are guaranteed to be hashable
return hash(frozenset(o))
elif isinstance(o, dict):
# dict keys are guaranteed to be hashable
return hash(sorted(hash((k, rec_hash(v)) for k,v in o.items())))
elif isinstance(o, collections.abc.Hashable):
# nested types may be un-hashable, could raise TypeError
return hash(o)
raise TypeError(f"unhashable type: '{type(o).__name__}'")
def det_hash(o: Any) -> str:
try:
with hash_seed(0): # TODO: need to implement this
h = rec_hash(o)
return base58.b58encode_int(h)
except TypeError:
pass
# Fallback to pickling
m = hashlib.blake2b()
with io.BytesIO() as buffer:
pickler = _DetHashPickler(buffer)
pickler.dump(o)
m.update(buffer.getbuffer())
return base58.b58encode(m.digest()).decode()
🐛 Describe the bug
Problem
I found that
det_hash
can sometimes hash equal objects differently. This was causing me a lot of grief when trying to manually initializing a step and retrieve its cached results. I feel that the default hashing in Tango should (at least) respect__eq__
for built-in types and fallback to Pickle serialization for more complex and custom data structures.Examples
Solutions
In Python, object hashes (
hash
or__hash__
) are indeed expected to respect equality: i.e.However, there are a few problems with using
__hash__
:mutable data types are not supposed to implement
__hash__
not all classes will implement
__hash__
determinism can only be toggled when initializing the interpreter
I propose the following implementation:
At hash time in Tango, we just want the object's instantaneous hash and do not actually care whether objects are immutable. So we can manually hash the mutable built-in types. Also, since mutable types may be nested, we can hash recursively.
We can simply fall back to
pickle
serialization for objects that are not hashable. I am assuming that hash collisions between__hash__
andpickle
are negligible.Two suggestions:
3.1. We could write a simple C extension to temporarily disable hash determinism during Tango's hashing function. (Note: I'm not personally familiar with cpython, am just assuming that this variable can be changed.) Then, we will have deterministic hashing for all types that implement
__hash__
. 3.2. Alternatively, we could write a custom function with deterministic hashing logic for all built-in Python types. This would be recursive (likerec_hash
above) to handle nested data structures. But custom classes that do implement__hash__
will not be supported.Together (with 3.1) that's:
Please let me know what you think, thanks!
Versions
Python 3.9.16 ai2-tango==1.2.1 base58==2.1.1 black==23.7.0 boto3==1.28.54 botocore==1.31.54 cached-path==1.4.0 cachetools==5.3.1 certifi==2023.7.22 charset-normalizer==3.2.0 click==8.1.7 click-help-colors==0.9.2 dill==0.3.7 filelock==3.12.4 fsspec==2023.9.2 gitdb==4.0.10 GitPython==3.1.37 glob2==0.7 google-api-core==2.12.0 google-auth==2.23.0 google-cloud-core==2.3.3 google-cloud-storage==2.11.0 google-crc32c==1.5.0 google-resumable-media==2.6.0 googleapis-common-protos==1.60.0 huggingface-hub==0.16.4 idna==3.4 jmespath==1.0.1 markdown-it-py==3.0.0 mdurl==0.1.2 more-itertools==9.1.0 packaging==23.1 petname==2.6 protobuf==4.24.3 pyasn1==0.5.0 pyasn1-modules==0.3.0 Pygments==2.16.1 python-dateutil==2.8.2 pytz==2023.3.post1 PyYAML==6.0.1 requests==2.31.0 rich==13.5.3 rjsonnet==0.5.3 rsa==4.9 s3transfer==0.6.2 six==1.16.0 smmap==5.0.1 sqlitedict==2.1.0 tqdm==4.66.1 typing_extensions==4.8.0 urllib3==1.26.16 xxhash==3.3.0