allenai / tango

Organize your experiments into discrete steps that can be cached and reused throughout the lifetime of your research project.
https://ai2-tango.readthedocs.io/
Apache License 2.0
532 stars 46 forks source link

`det_hash` is not consistent with `__eq__` #602

Open apoorvkh opened 1 year ago

apoorvkh commented 1 year ago

🐛 Describe the bug

Problem

I found that det_hash can sometimes hash equal objects differently. This was causing me a lot of grief when trying to manually initializing a step and retrieve its cached results. I feel that the default hashing in Tango should (at least) respect __eq__ for built-in types and fallback to Pickle serialization for more complex and custom data structures.

Examples

from tango.common.det_hash import det_hash

x = ["bert", "bert"]
y = ["bert", "bert "[:-1]]
assert x == y
assert det_hash(x) == det_hash(y)  # AssertionError

# Also, suppose you invoke this script as `python script.py bert bert`
assert x == sys.argv[1:]
assert det_hash(x) == det_hash(sys.argv[1:])  # AssertionError

# I understand that this is because the object hierarchies may be different for `x` and `y`.

x_addr = [id(i) for i in x]
y_addr = [id(i) for i in y]
assert x_addr[0] == x_addr[1]
assert y_addr[0] == y_addr[1]  # AssertionError

# Therefore, `x` and `y` pickle differently.

import dill
assert dill.dumps(x) == dill.dumps(y)  # AssertionError

Solutions

In Python, object hashes (hash or __hash__) are indeed expected to respect equality: i.e.

x == y implies both that x is y and hash(x) == hash(y)

However, there are a few problems with using __hash__:

  1. mutable data types are not supposed to implement __hash__

  2. not all classes will implement __hash__

  3. determinism can only be toggled when initializing the interpreter

I propose the following implementation:

  1. At hash time in Tango, we just want the object's instantaneous hash and do not actually care whether objects are immutable. So we can manually hash the mutable built-in types. Also, since mutable types may be nested, we can hash recursively.

  2. We can simply fall back to pickle serialization for objects that are not hashable. I am assuming that hash collisions between __hash__ and pickle are negligible.

  3. Two suggestions:

    3.1. We could write a simple C extension to temporarily disable hash determinism during Tango's hashing function. (Note: I'm not personally familiar with cpython, am just assuming that this variable can be changed.) Then, we will have deterministic hashing for all types that implement __hash__. 3.2. Alternatively, we could write a custom function with deterministic hashing logic for all built-in Python types. This would be recursive (like rec_hash above) to handle nested data structures. But custom classes that do implement __hash__ will not be supported.

Together (with 3.1) that's:

def rec_hash(o: Any) -> str:
    if isinstance(o, collections.abc.Sequence):  # tuples, lists, ranges
        return hash((rec_hash(x) for x in o))
    elif isinstance(o, set):
        # set elements are guaranteed to be hashable
        return hash(frozenset(o))
    elif isinstance(o, dict):
        # dict keys are guaranteed to be hashable
        return hash(sorted(hash((k, rec_hash(v)) for k,v in o.items())))
    elif isinstance(o, collections.abc.Hashable):
        # nested types may be un-hashable, could raise TypeError
        return hash(o)
    raise TypeError(f"unhashable type: '{type(o).__name__}'")

def det_hash(o: Any) -> str:
    try:
        with hash_seed(0):  # TODO: need to implement this
            h = rec_hash(o)
        return base58.b58encode_int(h)
    except TypeError:
        pass

    # Fallback to pickling
    m = hashlib.blake2b()
    with io.BytesIO() as buffer:
        pickler = _DetHashPickler(buffer)
        pickler.dump(o)
        m.update(buffer.getbuffer())
        return base58.b58encode(m.digest()).decode()

Please let me know what you think, thanks!

Versions

Python 3.9.16 ai2-tango==1.2.1 base58==2.1.1 black==23.7.0 boto3==1.28.54 botocore==1.31.54 cached-path==1.4.0 cachetools==5.3.1 certifi==2023.7.22 charset-normalizer==3.2.0 click==8.1.7 click-help-colors==0.9.2 dill==0.3.7 filelock==3.12.4 fsspec==2023.9.2 gitdb==4.0.10 GitPython==3.1.37 glob2==0.7 google-api-core==2.12.0 google-auth==2.23.0 google-cloud-core==2.3.3 google-cloud-storage==2.11.0 google-crc32c==1.5.0 google-resumable-media==2.6.0 googleapis-common-protos==1.60.0 huggingface-hub==0.16.4 idna==3.4 jmespath==1.0.1 markdown-it-py==3.0.0 mdurl==0.1.2 more-itertools==9.1.0 packaging==23.1 petname==2.6 protobuf==4.24.3 pyasn1==0.5.0 pyasn1-modules==0.3.0 Pygments==2.16.1 python-dateutil==2.8.2 pytz==2023.3.post1 PyYAML==6.0.1 requests==2.31.0 rich==13.5.3 rjsonnet==0.5.3 rsa==4.9 s3transfer==0.6.2 six==1.16.0 smmap==5.0.1 sqlitedict==2.1.0 tqdm==4.66.1 typing_extensions==4.8.0 urllib3==1.26.16 xxhash==3.3.0