Ensure correct dtypes in dependency table

hagenw commented 1 month ago

Closes #414

Before, it could happen that the dtypes of a dependency table loaded from cache file, depend on the audb version used to store the cache file. This resulted in issues, as Dependencies.__eq__() require that the dtypes match as well.

In #414, I proposed to relax Dependencies.__eq__() to not compare the dtypes. But I think it is better to ensure, that the loaded dependency table has always the same dtypes. I updated the code here, to set the correct dtypes, when loading a dependency table from cache.

To achieve this, I introduced a new static method audb.Dependencies._set_dtypes(), which is reused at 4 different positions.

hagenw commented 1 month ago

Datastructure-wise: Currently we have two separate list-likes define.DEPEND_FIELD_NAMES and define.DEPEND_FIELD_DTYPES that have no inner relationship between columns and types. Pyarrow uses a list of 2-tuples, polars an ordered dict to map columns to types. These representations have imo a clearer mapping between column name and column type. Do we have a compelling reason for not having this association built into the typing? In fact, the implementation of _set_dtypes then creates this mapping (in a dict like fashion):

mapping = {column: dtype for column, dtype in zip(columns, dtypes)}

Thanks for this feedback. I totally agree with your observation, and would also propose to update our current structure. As this involves a little bit more work, and is not directly related with #414, which we are trying to solve here, I would propose to do it in another pull request, and have created https://github.com/audeering/audb/issues/419 to track this.

hagenw commented 1 month ago

I would in principle approve if there wasn't a failing test

That's interesting. Can you provide a little bit more feedback, e.g. which Python version did you use; show the output when running the test with -vv; does the test also fail for the main branch, or only here?

Update: please also report your used pandas version, maybe they changed how a dataframe is represented as string?

ChristianGeng commented 1 month ago

I would in principle approve if there wasn't a failing test

That's interesting. Can you provide a little bit more feedback, e.g. which Python version did you use; show the output when running the test with -vv; does the test also fail for the main branch, or only here?

Update: please also report your used pandas version, maybe they changed how a dataframe is represented as string?

Environment is this:

Python 3.11.9
pandas 2.2.2
happens also on main

Using the -vv only reports the same as above diagnostically.. So I have captured the values at the time when identity is asserted.

expected_str

'               archive  bit_depth  channels  ... sampling_rate  type version\ndb.files.csv  archive1          0         0  ...             0     0   1.0.0\nfile.wav      archive2         16         2  ...         16000     1   1.0.0\n\n[2 rows x 10 columns]'

str(deps)

str(deps)
'               archive  bit_depth  channels                          checksum  duration format  removed  sampling_rate  type version\ndb.files.csv  archive1          0         0  7c1f6b568f7221ab968a705fd5e7477b       0.0    csv        0              0     0   1.0.0\nfile.wav      archive2         16         2  917338b854ad9c72f76bc9a68818dcd8      1.23    wav        0          16000     1   1.0.0'

In other words, the difference seems to be the [2 rows x 10 columns] at the end. When I remove this line in expeced, test goes back to green.

hagenw commented 1 month ago

That's interesting.

One way to generate this difference in output is to use pandas.Dataframe.to_string() explicitely:

>>> print(deps._df)
               archive  bit_depth  channels  ... sampling_rate  type version
db.files.csv  archive1          0         0  ...             0     0   1.0.0
file.wav      archive2         16         2  ...         16000     1   1.0.0

[2 rows x 10 columns]

>>> print(deps._df.to_string())
               archive  bit_depth  channels                          checksum  duration format  removed  sampling_rate  type version
db.files.csv  archive1          0         0  7c1f6b568f7221ab968a705fd5e7477b      0.00    csv        0              0     0   1.0.0
file.wav      archive2         16         2  917338b854ad9c72f76bc9a68818dcd8      1.23    wav        0          16000     1   1.0.0

So, for whatever reason, for you str(df) defaults to df.to_string(), but not for me.

We could change our current implementation of https://github.com/audeering/audb/blob/44df511ff8709b505e4cc3cad44ac74349bb8567/audb/core/dependencies.py#L135-L136 to

    def __str__(self) -> str:  # noqa: D105
        return str(self._df.to_string())

But I think it is better to just relax the test slightly.

hagenw commented 1 month ago

I created https://github.com/audeering/audb/pull/423 to tackle the problem of the failing test.

I would propose you can continue here with the review of the actual changes.

ChristianGeng commented 1 month ago

I created #423 to tackle the problem of the failing test.

I would propose you can continue here with the review of the actual changes.

There isn't anything to continue. The design-related stuff that was imo most important has resulted in a follow up. With a reorganized scheme, one could cast an entire data frame to that scheme.

With the code per se I am happy and I see its necessity. Testing problems I had have also been addressed. You can see this as approved.

audeering / audb

Ensure correct dtypes in dependency table #416