artefactual-sdps / preprocessing-base

Enduro preprocessing child workflow base repository
2 stars 0 forks source link

Problem: fatal error when SIP is a BagIt bag #10

Open djjuhasz opened 4 months ago

djjuhasz commented 4 months ago

If the SIP delivered to Enduro is a BagIt bag the pre-processing workflow fails with a fatal error at the "CreateBagActivity":

CreateBagActivity: create bag: mkdir /home/preprocessing/shared/enduro3594443426/extract3168784964/ZippedBag/data: file exists

To Reproduce

Steps to reproduce the behavior:

  1. Run Enduro with preprocessing-base enabled for preprocessing
  2. Upload a zipped bag to the Enduro MinIO "sips" bucket
  3. The above error occurs

Expected behavior

In real world implementations of preprocessing the SIP delivered by Enduro will be modified before being bagged and sent back to Enduro in which case the bag will need to be updated or its contents "unbagged" to prevent errors validating the bag payload against its manifest.

If the "CreateBagActivity" receives a BagIt bag as input, it should return the path of the bag, without altering the bag. Preprocessing will then deliver the unaltered bag to Enduro for further processing.

Additional context

See https://github.com/artefactual-sdps/enduro/issues/805 for more information about the exchange of bags between preprocessing and Enduro.

jraddaoui commented 4 months ago

I guess making the CreateBagActivity noop when it receives a Bag won't hurt, but I was thinking to address this issue just with documentation. I see this template repository as an example never to be run as part of a real workflow.

djjuhasz commented 4 months ago

I tested https://github.com/LibraryOfCongress/bagit-python to see what it does when asked to bag a bag, and it just "double bags" the contents -- everything in the original bag (including manifests, metadata files, and the data directory) are put in a "data" directory and then it generates new manifests for everything. I don't know that I want to implement the same behaviour for the CreateBagActivity, but I think we should do better than the current error.

Another option is just to return a better error message like 'create bag: /path/to/dir is already a bag" or something similar.