UAL-RE / ReBACH

Python-based tool to enable data preservation to a cloud-hosted storage solution
MIT License
2 stars 2 forks source link

Upload bag with changed curation files #61

Open zoidy opened 1 year ago

zoidy commented 1 year ago

Is there an existing issue for this?

Description

Currently, if a preservation bag exists on preservation storage, only changes in the Figshare data/metadata will result in detecting that a bag being processed is different than the corresponding bag on preservation storage (via the hash in the bag name).

This means that if there are any changes to any other part of the bagged content that is not coming from the Figshare side, (e.g., curation metadata), ReBACH will show a message saying that the bag being created is a duplicate of an existing bag and will not upload it to preservation storage. This is undesirable sometimes since curation files may be added/updated later. However, replacing the existing file when curation data changes is not desirable ALL the time (since it could be the result of an error)

Suggested Implementation

Implement in two phases 1. Add a check to see if the bag to be uploaded is a different size than the one in preservation if the hash in the bag name is the same. Display a warning if not (to allow checking the logs)

  1. Add a config and/or commandline flag to enable overwriting existing bags with the same name

Edit: phase 1 isn't possible because Dart handles bag creation and upload so there is no easy way to check the bag size before it's uploaded. Therefore, the only way updated curation files can be uploaded is to overwrite the bag without the check (overwriting is already possible by setting the flag in the bagger config).