audeering / audformat

Format to store media files and annotations
https://audeering.github.io/audformat/
Other
9 stars 0 forks source link

Provide table dataframe hashing in public API? #447

Open hagenw opened 1 week ago

hagenw commented 1 week ago

In https://github.com/audeering/audformat/pull/419 we introduced storing of tables as parquet files, and created a custom hashing method for it as audformat.utils.hash() does not consider column names, and does not provide consistent hashes across different pandas versions.

The steps of the hashing, that could be added to a public function, are:

https://github.com/audeering/audformat/blob/c1328079be136a1801759891d851ca03953f4dad/audformat/core/table.py#L1137-L1142

The only downside might be that we have already audformat.utils.hash(), which is meant for index, series, and dataframe at the moment. In our own tools, we only use the hashing of index. So the question arises how we should name the new hash function, and how it should be positioned with regards to audformat.utils.hash().

hagenw commented 1 week ago

@ChristianGeng any thoughts on this?

ChristianGeng commented 1 week ago

The steps of the hashing, that could be added to a public function, are:

My understanding of "public function" is that it does not sit in utils but in the table module that is supposed to be programmed against by an audformat user.

For me the main difference lies in the fact that audformat.utils.hash() pertains to pandas objects, whereas the table hash is a property of an audformat table. Saying propety: why not defining it as a property in Base table class?

So would it not be enough to mention in the docstring of audformat.utils.hash() that it is meant to describe pandas objects? I mean this is already clear through the type hint, but one could possibly make this more foolproof?

hagenw commented 1 week ago

The steps of the hashing, that could be added to a public function, are:

My understanding of "public function" is that it does not sit in utils but in the table module that is supposed to be programmed against by an audformat user.

Public means that it is part of the public API (e.g. available outside of the audformat.core namespace, and included in the documentation). It doesn't matter if the new function is stored under audformat.utils or at another place (of cause we should pick a meaningful place).

hagenw commented 1 week ago

For me the main difference lies in the fact that audformat.utils.hash() pertains to pandas objects, whereas the table hash is a property of an audformat table. Saying propety: why not defining it as a property in Base table class?

The new hashing function is not bound to the table but would also work on arbritrary dataframes, and does not need to access any other part of a audformat.Table object:

def hash_dataframe(df: pandas.DataFrame) -> str:
    r"""MD5 hash of dataframe based on its content, index, and columns."""
    table = pa.Table.from_pandas(df.reset_index(), preserve_index=False)
    hash = hashlib.md5()
    hash.update(_schema_hash(table))
    hash.update(_dataframe_hash(df))
    return hash.hexdigest()

def _dataframe_hash(df: pd.DataFrame) -> bytes:
    """Hash a dataframe.

    The hash value takes into account:

    * index of dataframe
    * values of the dataframe
    * order of dataframe rows

    It does not consider:

    * column names of dataframe
    * dtypes of dataframe

    Args:
        df: dataframe

    Returns:
        MD5 hash in bytes

    """
    md5 = hashlib.md5()
    for _, y in df.reset_index().items():
        # Convert every column to a numpy array,
        # and hash its string representation
        if y.dtype == "Int64":
            # Enforce consistent conversion to numpy.array
            # for integers across different pandas versions
            # (since pandas 2.2.x, Int64 is converted to float if it contains <NA>)
            y = y.astype("float")
        md5.update(bytes(str(y.to_numpy()), "utf-8"))
    return md5.digest()

def _schema_hash(table: pa.Table) -> bytes:
    r"""Hash pyarrow table schema.

    Args:
        table: pyarrow table

    Returns:
        MD5 hash in bytes

    """
    schema_str = table.schema.to_string(
        # schema.metadata contains pandas related information,
        # and the used pyarrow and pandas version,
        # and needs to be excluded
        show_field_metadata=False,
        show_schema_metadata=False,
    )
    return hashlib.md5(schema_str.encode()).digest()

This could maybe be simplified to:

def dataframe_hash(df: pd.DataFrame) -> bytes:
    """Hash a dataframe.

    The hash value takes into account:

    * index of dataframe
    * values of the dataframe
    * order of dataframe rows
    * column names of dataframe
    * dtypes of dataframe

    Args:
        df: dataframe

    Returns:
        MD5 hash in bytes

    """
    hash = hashlib.md5()
    # Handle column names and dtypes
    table = pa.Table.from_pandas(df.reset_index(), preserve_index=False)
    schema_str = table.schema.to_string(
        # schema.metadata contains pandas related information,
        # and the used pyarrow and pandas version,
        # and needs to be excluded
        show_field_metadata=False,
        show_schema_metadata=False,
    )
    hash.update(schema_str.encode())
    # Handle index, values, and row order
    for _, y in df.reset_index().items():
        # Convert every column to a numpy array,
        # and hash its string representation
        if y.dtype == "Int64":
            # Enforce consistent conversion to numpy.array
            # for integers across different pandas versions
            # (since pandas 2.2.x, Int64 is converted to float if it contains <NA>)
            y = y.astype("float")
        hash.update(bytes(str(y.to_numpy()), "utf-8"))
    return hash.digest.hexdigest()
ChristianGeng commented 1 week ago

Understood. I know see that it is necessary to expose hash_dataframe in addition to audformat.utils.hash . And as it can be used outside of the context of table this creates confusion as both would accept pandas a pd.DataFrame, and so utils would also be a natural location to have it implemented.

The above implementation would create a new function hash_dataframe that would coexist with audformat.utils.hash ., which would be ugly. Would it make sense to add a new kw arg to audformat.utils.hash where you can request a kind (or flavor) of audformat.utils.hash in the public api that defaults to the "old" behaviour?

E.g.


def hash(obj, behavior='old'):
    match behavior:
        case 'old': -> return_as_before(obj)
        case 'new': -> return dataframe_hash(obj)
        case _: raise ValueError("only the old and new behavior are allowed")

On the other hand, the old version accepts all kinds of python objects, not only frames. I find this hard to get it right and decide actually.

I hope that I now get the question right :-)

hagenw commented 1 week ago

Yes, you got the question right.

Adding a new keyword to audformat.utils.hash() sounds like a good idea to me. Then we could even change the default of that keyword at some point in the future. audformat.utils.hash() is meant to work only on pandas index, series and dataframe. All those objects could be converted internally to a dataframe, so they should also all work with the new implementation.