audeering / audformat

Format to store media files and annotations
https://audeering.github.io/audformat/
Other
11 stars 1 forks source link

audformat.utils.hash() changes with different pandas versions #433

Closed hagenw closed 3 months ago

hagenw commented 3 months ago

Assume the following dataframe:

import pandas as pd

index = pd.Index([0, 1, 2], dtype="Int64", name="speaker")
df = pd.DataFrame([34, 56, 13], index=index, columns=["age"])

With pandas 2.1.0 we get:

>>> audformat.utils.hash(df)
'4310450105456762724'

With pandas 2.2.2 we get:

>>> audformat.utils.hash(df)
'6045940399858359459'

We had a similar problem before, see https://github.com/pandas-dev/pandas/issues/55452. When revisiting using the example presented there:

index = pd.Index(['f1', 'f2'], dtype='string', name='file')

I can repeat the results for pandas 2.1.0:

>>> pd.util.hash_pandas_object(index).sum()
14215128657272711653
>>> pd.util.hash_pandas_object(index).astype("int64").sum()
-4231615416436839963

In pandas 2.2.2 it also works:

>>> pd.util.hash_pandas_object(index).sum()
14215128657272711653
>>> pd.util.hash_pandas_object(index).astype("int64").sum()
-4231615416436839963

But when applying it on the dataframe from above, it fails with:

pandas 2.1.0

>>> pd.util.hash_pandas_object(df).astype("int64")
speaker
0   -4094276622505563603
1     626491167018614217
2    7778235560943712110
dtype: int64

>>> pd.util.hash_pandas_object(df).astype("int64").sum()
4310450105456762724

pandas 2.2.2:

>>> pd.util.hash_pandas_object(df).astype("int64")
speaker
0   -5443393224012433233
1   -5373147090767336438
2   -1584263359071422486
dtype: int64

>>> pd.util.hash_pandas_object(df).astype("int64").sum()
6045940399858359459
hagenw commented 3 months ago

I reported it at https://github.com/pandas-dev/pandas/issues/58999

hagenw commented 3 months ago

The underlying calculation was changed in pandas 2.2.0 and this will not be fixed in pandas. So, I don't think we can do much about here.

The good news is that it will most likely not affect the caching in audinterface as the results do not change for a filewise or segmented index:

>>> index = audformat.segmented_index(["f1"], [0], [1])
>>> audformat.utils.hash(index)
'5191304663967199877'

But maybe, we should add a test for it in audinterface.