Closed hagenw closed 1 month ago
Datastructure-wise: Currently we have two separate list-likes
define.DEPEND_FIELD_NAMES
anddefine.DEPEND_FIELD_DTYPES
that have no inner relationship between columns and types. Pyarrow uses a list of 2-tuples, polars an ordered dict to map columns to types. These representations have imo a clearer mapping between column name and column type. Do we have a compelling reason for not having this association built into the typing? In fact, the implementation of_set_dtypes
then creates this mapping (in a dict like fashion):
mapping = {column: dtype for column, dtype in zip(columns, dtypes)}
Thanks for this feedback. I totally agree with your observation, and would also propose to update our current structure. As this involves a little bit more work, and is not directly related with #414, which we are trying to solve here, I would propose to do it in another pull request, and have created https://github.com/audeering/audb/issues/419 to track this.
I would in principle approve if there wasn't a failing test
That's interesting. Can you provide a little bit more feedback, e.g. which Python version did you use; show the output when running the test with -vv
; does the test also fail for the main branch, or only here?
Update: please also report your used pandas
version, maybe they changed how a dataframe is represented as string?
I would in principle approve if there wasn't a failing test
That's interesting. Can you provide a little bit more feedback, e.g. which Python version did you use; show the output when running the test with
-vv
; does the test also fail for the main branch, or only here?Update: please also report your used
pandas
version, maybe they changed how a dataframe is represented as string?
Environment is this:
Using the -vv
only reports the same as above diagnostically..
So I have captured the values at the time when identity is asserted.
expected_str
' archive bit_depth channels ... sampling_rate type version\ndb.files.csv archive1 0 0 ... 0 0 1.0.0\nfile.wav archive2 16 2 ... 16000 1 1.0.0\n\n[2 rows x 10 columns]'
str(deps)
str(deps)
' archive bit_depth channels checksum duration format removed sampling_rate type version\ndb.files.csv archive1 0 0 7c1f6b568f7221ab968a705fd5e7477b 0.0 csv 0 0 0 1.0.0\nfile.wav archive2 16 2 917338b854ad9c72f76bc9a68818dcd8 1.23 wav 0 16000 1 1.0.0'
In other words, the difference seems to be the [2 rows x 10 columns]
at the end. When I remove this line in expeced, test goes back to green.
That's interesting.
One way to generate this difference in output is to use pandas.Dataframe.to_string()
explicitely:
>>> print(deps._df)
archive bit_depth channels ... sampling_rate type version
db.files.csv archive1 0 0 ... 0 0 1.0.0
file.wav archive2 16 2 ... 16000 1 1.0.0
[2 rows x 10 columns]
>>> print(deps._df.to_string())
archive bit_depth channels checksum duration format removed sampling_rate type version
db.files.csv archive1 0 0 7c1f6b568f7221ab968a705fd5e7477b 0.00 csv 0 0 0 1.0.0
file.wav archive2 16 2 917338b854ad9c72f76bc9a68818dcd8 1.23 wav 0 16000 1 1.0.0
So, for whatever reason, for you str(df)
defaults to df.to_string()
, but not for me.
We could change our current implementation of https://github.com/audeering/audb/blob/44df511ff8709b505e4cc3cad44ac74349bb8567/audb/core/dependencies.py#L135-L136 to
def __str__(self) -> str: # noqa: D105
return str(self._df.to_string())
But I think it is better to just relax the test slightly.
I created https://github.com/audeering/audb/pull/423 to tackle the problem of the failing test.
I would propose you can continue here with the review of the actual changes.
I created #423 to tackle the problem of the failing test.
I would propose you can continue here with the review of the actual changes.
There isn't anything to continue. The design-related stuff that was imo most important has resulted in a follow up. With a reorganized scheme, one could cast an entire data frame to that scheme.
With the code per se I am happy and I see its necessity. Testing problems I had have also been addressed. You can see this as approved.
Closes #414
Before, it could happen that the dtypes of a dependency table loaded from cache file, depend on the
audb
version used to store the cache file. This resulted in issues, asDependencies.__eq__()
require that the dtypes match as well.In #414, I proposed to relax
Dependencies.__eq__()
to not compare the dtypes. But I think it is better to ensure, that the loaded dependency table has always the same dtypes. I updated the code here, to set the correct dtypes, when loading a dependency table from cache.To achieve this, I introduced a new static method
audb.Dependencies._set_dtypes()
, which is reused at 4 different positions.