audeering / audbackend

Manage file storage on different backends
https://audeering.github.io/audbackend/
Other
3 stars 0 forks source link

Introduce optional validation step #190

Closed frankenjoe closed 7 months ago

frankenjoe commented 7 months ago

Closes #189

Adds validate argument to the following functions:

If set to True a final check is performed to assert that the new file has the expected checksum. If the check fails, an InterruptedError is raised. This comes with additional costs calculating the checksum of the original and new file. In case of move_file() the file is copied and only removed if the check was successful.

DOC

Here are the new docstrings of backend.Base:

image

image

image

image

image

image

Similar changes were made to interface.Unversioned and interface.Versioned

frankenjoe commented 7 months ago

Our tests are not working with pytest==8.1.0 which was released on the weekend.

hagenw commented 7 months ago

Looks to me that not our tests, but doctestplus is not working with the new pytest version:

pluggy._manager.PluginValidationError: Plugin 'doctestplus' for hook 'pytest_collect_file'

So maybe we wait a few days?

codecov[bot] commented 7 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 100.0%. Comparing base (11e4cda) to head (5c1efb9). Report is 1 commits behind head on dev.

Additional details and impacted files | [Files](https://app.codecov.io/gh/audeering/audbackend/pull/190?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=audeering) | Coverage Δ | | |---|---|---| | [audbackend/core/backend/base.py](https://app.codecov.io/gh/audeering/audbackend/pull/190?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=audeering#diff-YXVkYmFja2VuZC9jb3JlL2JhY2tlbmQvYmFzZS5weQ==) | `100.0% <100.0%> (ø)` | | | [audbackend/core/interface/unversioned.py](https://app.codecov.io/gh/audeering/audbackend/pull/190?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=audeering#diff-YXVkYmFja2VuZC9jb3JlL2ludGVyZmFjZS91bnZlcnNpb25lZC5weQ==) | `100.0% <100.0%> (ø)` | | | [audbackend/core/interface/versioned.py](https://app.codecov.io/gh/audeering/audbackend/pull/190?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=audeering#diff-YXVkYmFja2VuZC9jb3JlL2ludGVyZmFjZS92ZXJzaW9uZWQucHk=) | `100.0% <ø> (ø)` | |
frankenjoe commented 7 months ago

So maybe we wait a few days?

We can wait with a release until it's fixed, in the meanwhile we can work with #192

frankenjoe commented 7 months ago

Just out of curiosity, can it happen from time to time for some backends that a file gets corrupted that you need this feature?

Don't know, depends on the backend I guess :)

But giving it a second thought - the main issue are exceptions raised on the backend. When we download a file we are relatively safe because we only move it to the final destination when the download is finished. But with uploads we can end up with incomplete files. However, the strategy we implement here may not apply if an error is raised before the validation check. So maybe we have to consider this? E.g. we could suppress backend errors if validate=True. If something goes wrong, the original error is not raised, but we will afterward get a InterruptedError and the corrupted file is removed. Maybe we can also catch the original error and print it to give the user a hint what went wrong.

hagenw commented 7 months ago

If something goes wrong, the original error is not raised, but we will afterward get a InterruptedError and the corrupted file is removed

"something wrong" might also mean that we loose connection to the server, which mean we cannot remove the file. I think in that case the backend needs to support an auto-clean up, e.g. we provide the MD5 sum before starting the upload. When we end (or loose) our connection the backend server than needs to check if the MD5 sum matches and removes the file if not. As I don't think that most backends support something like this, I don't think this is a possible solution.

So I think, I would not automatically suppress the backend error if validate=True. If something goes wrong on the backend, the user will have to act anyway.

frankenjoe commented 7 months ago

In principle we could also work with a temporary directory on the backend, which we clean up from time to time. But that really only makes sense if the backend provides a native move function. So also not really a generic solution.

frankenjoe commented 7 months ago

So I think, I would not automatically suppress the backend error if validate=True. If something goes wrong on the backend, the user will have to act anyway.

I agree.

The use-case I have in mind is to delete a local copy of file if and only if the file was successfully uploaded to the backend. This can now be achieved by:

try:
    backend.put_file(src_path, dst_path, validate=True)
    os.remove(src_path)   # if we arrive here we can safely remove src_path
except Exception:
    pass

Now this is even safe if the backend does not raise an error in case something goes wrong during the upload.

hagenw commented 7 months ago

OK, looks like a valid example to me. This means we are ready here and don't need further changes to the code, correct?

frankenjoe commented 7 months ago

I would say yes.