audeering / audb

Manage audio and video databases
https://audeering.github.io/audb/
Other
23 stars 1 forks source link

Updating and publishing databases without `parquet` fails with 1.7.2 #414

Closed maxschmitt closed 1 month ago

maxschmitt commented 1 month ago

I am using Python 3.10, audb==1.7.2.

Traceback (most recent call last):
  File "/home/user/mydb/2.0.0/publish.py", line 15, in <module>
    audb.publish(
  File "/home/user/mydb/2.0.0/myenv/lib/python3.10/site-packages/audb/core/publish.py", line 698, in publish
    raise RuntimeError(
RuntimeError: You want to depend on '1.0.0' of mydb, but the dependency file 'db.parquet' in /home/user/mydb/build does not match the dependency file for the requested version in the repository. Did you forgot to call 'audb.load_to(/home/user/mydb/build, mydb, version='1.0.0') or modified the file manually?

Switching back to audb==1.6.5, where no db.parquet file is generated for the updated database made it work.

The corresponding database has only misc_table (in case this is relevant).

schruefer commented 1 month ago

I have the same problem with Python 3.8 and versions 1.7.1 and 1.7.2

Traceback (most recent call last):
  File "/home/user/projects/mydb/3.3.0/publish.py", line 11, in <module>
    audb.publish(
  File "/home/user/projects/mydb/.venv/lib/python3.8/site-packages/audb/core/publish.py", line 698, in publish
    raise RuntimeError(
RuntimeError: You want to depend on '3.2.0' of mydb, but the dependency file 'db.parquet' in ./build does not match the dependency file for the requested version in the repository. Did you forgot to call 'audb.load_to(./build, mydb, version='3.2.0') or modified the file manually?

When I try using audb==1.6.5 I get the backend error:

  File "/home/user/projects/mydb/.venv/lib/python3.8/site-packages/audbackend/core/backend/base.py", line 407, in exists
    raise RuntimeError(backend_not_opened_error)
RuntimeError: Call 'Backend.open()' to establish a connection to the repository first.
hagenw commented 1 month ago

Thanks for reporting, I will have a look at it.

If you want to use audb==1.6.5 as a workaround, you need to ensure you use audbackend<2.0.0 as well:

$ pip install "audb==1.6.5"
$ pip install "audbackend<2.0.0"
hagenw commented 1 month ago

A minimal example to reproduce this error:

create.py

import audb
import audeer

build_dir = "./build"
audeer.rmdir(build_dir)
audeer.mkdir(build_dir)

db = audb.load_to(build_dir, "emodb", version="1.4.1", only_metadata=True)
db.description = "new"
db.save(build_dir)

publish.py

import audb
import audeer

build_dir = audeer.path("./build")

repo = "repo"
host = audeer.path("./host")
audeer.rmdir(host)
audeer.mkdir(host, repo)

repository = audb.Repository(repo, host, "file-system")
audb.publish(build_dir, "1.5.0", repository, previous_version="1.4.1")

Then we get:

$ python create.py
$ python publish.py
Traceback (most recent call last):
  File "/home/hwierstorf/tmp/audb-update-bug/publish.py", line 13, in <module>
    audb.publish(build_dir, "1.5.0", repository, previous_version="1.4.1")
  File "/home/hwierstorf/.envs/audb-update-bug/lib/python3.10/site-packages/audb/core/publish.py", line 698, in publish
    raise RuntimeError(
RuntimeError: You want to depend on '1.4.1' of emodb, but the dependency file 'db.parquet' in /home/hwierstorf/tmp/audb-update-bug/build does not match the dependency file for the requested version in the repository. Did you forgot to call 'audb.load_to(/home/hwierstorf/tmp/audb-update-bug/build, emodb, version='1.4.1') or modified the file manually?

If I delete the cache before trying to publish the new version it works:

$ rm -rf ~/audb/emodb/1.4.1/
$ python create.py
$ python publish.py
$ tree host/repo/
host/repo/
└── emodb
    └── 1.5.0
        ├── db.parquet
        └── db.yaml

2 directories, 2 files

So, the error seems to be related to https://github.com/audeering/audb/issues/402

hagenw commented 1 month ago

The problem arises from our implementation of audb.Dependencies.__eq__(), which compares the dataframes of the dependency tables:

https://github.com/audeering/audb/blob/44df511ff8709b505e4cc3cad44ac74349bb8567/audb/core/dependencies.py#L107-L117

When loading the original dependency table of version 1.4.1, that was stored with audb<1.7 its string dtype is string[python], whereas the new dependency table has string[pyarrow], which leads to the following results when asserting the dataframes should be equal:

AssertionError: Attributes of DataFrame.iloc[:, 0] (column name="archive") are different

Attribute "dtype" are different
[left]:  string[python]
[right]: string[pyarrow]

So, I guess we should update the implementation of audb.Dependencies.__eq__() to ensure backward compatibility.

hagenw commented 1 month ago

One solution might be to ignore the dtypes of the dataframes, e.g. changing

         return self._df.equals(other._df)

to

         return self._df.equals(other._df.astype(self._df.dtypes))
maxschmitt commented 1 month ago

Maybe, a warning could be printed that the dataframes do not have the same type and the reason is probably that the previous version was made with the audb < 1.7 and then, the code ignoring dtypes is used as a fallback.

hagenw commented 1 month ago

As the goal is that we should have backward compatibility with existing cache files, I think we don't need to show a warning. The dtype of the entries in the dependency table can anyway not be directly changed by the user as the file is always be created by audb, so the user shouldn't worry about it.