Erotemic / ubelt

A Python utility library with a stdlib like feel and extra batteries. Paths, Progress, Dicts, Downloads, Caching, Hashing: ubelt makes it easy!
Apache License 2.0
724 stars 44 forks source link

test_numpy_object_array fails: TypeError: directly hashing ndarrays with dtype=object is unstable #149

Closed yurivict closed 1 year ago

yurivict commented 1 year ago

Describe the bug


――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――― test_numpy_object_array ――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――

    def test_numpy_object_array():
        """
        _HASHABLE_EXTENSIONS = ub.util_hash._HASHABLE_EXTENSIONS
        """
        if np is None:
            pytest.skip('requires numpy')
        # An object array should have the same repr as a list of a tuple of data
        data = np.array([1, 2, 3], dtype=object)
>       objhash = ub.hash_data(data)

tests/test_hash.py:245: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
ubelt/util_hash.py:1107: in hash_data
    _update_hasher(hasher, data, types=types, extensions=extensions)
ubelt/util_hash.py:953: in _update_hasher
    prefix, hashable = _convert_to_hashable(data, types,
ubelt/util_hash.py:875: in _convert_to_hashable
    prefix, hashable = hash_func(data)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

data = array([1, 2, 3], dtype=object)

    @self.register(np.ndarray)
    def _convert_numpy_array(data):
        """
        Example:
            >>> import ubelt as ub
            >>> if not ub.modname_to_modpath('numpy'):
            ...     raise pytest.skip()
            >>> import numpy as np
            >>> data_f32 = np.zeros((3, 3, 3), dtype=np.float64)
            >>> data_i64 = np.zeros((3, 3, 3), dtype=np.int64)
            >>> data_i32 = np.zeros((3, 3, 3), dtype=np.int32)
            >>> hash_f64 = _hashable_sequence(data_f32, types=True)
            >>> hash_i64 = _hashable_sequence(data_i64, types=True)
            >>> hash_i32 = _hashable_sequence(data_i64, types=True)
            >>> assert hash_i64 != hash_f64
            >>> assert hash_i64 != hash_i32
        """
        if data.dtype.kind == 'O':
            msg = 'directly hashing ndarrays with dtype=object is unstable'
>           raise TypeError(msg)
E           TypeError: directly hashing ndarrays with dtype=object is unstable

ubelt/util_hash.py:546: TypeError

Version: 1.3.0 Python-3.9 FreeBSD 13.2

Erotemic commented 1 year ago

Thanks for the report!

This is indeed a bug (and a difficult to spot one as well). As the error message says, we cannot hash numpy object arrays directly, but what ubelt should do is interpret it as an iterable object and effectively treat it like a list. The bug is that if the first call you make to hash_data is with an object, the extension that registers how to deal with numpy objects isn't properly initialized. I've made a patch to 1.3.1 which will ensure the extensions are initialized before this iterable check happens: https://github.com/Erotemic/ubelt/pull/148