artefactual-sdps / enduro

Designed to automate the processing of transfers in multiple Archivematica pipelines.
https://enduro.readthedocs.io/
Apache License 2.0
4 stars 3 forks source link

Feature: send a BagIt bag to Archivematica for preservation #805

Open jraddaoui opened 6 months ago

jraddaoui commented 6 months ago

Is your feature request related to a problem? Please describe.

Currently, all transfers started in Archivematica use the zipfile transfer type:

https://github.com/artefactual-sdps/enduro/blob/main/internal/am/start_transfer.go#L49

This is not an issue in the current implementation where the transfer is always bundled as a ZIP file. However, it limits the extensibility of the workflow; thinking in the particular case of the SFA fork, where the transfer is transformed into a zipped bag in the pre-processing activities:

https://github.com/artefactual-sdps/enduro-sfa/pull/4/files#diff-ae98fc39bbc9e053ec8d1d2ed56184cd9ba7ea280d3e72975617da81c3cfadd3

Describe the solution you'd like

Provide a configuration setting like the one used for the processing configuration:

https://github.com/artefactual-sdps/enduro/blob/main/enduro.toml#L99

Describe alternatives you've considered

Allow changing the transfer type value in workflow. Thinking about the possibility of using child workflows to manage that extensibility, another option could be to indicate the transfer type in the child workflow result.

djjuhasz commented 6 months ago

@jraddaoui I agree it would be better to allow different transfer types to be sent to Archivematica, but in the current processing workflow the bundle activity will convert an incoming Bag transfer into a standard transfer which is then zipped and sent to AM (or a3m). Allowing a Bag transfer to be sent to Archivematica will require removing the bundle activity from the AM workflow or updating it to support multiple output transfer types.

djjuhasz commented 6 months ago

Note: the conversion of Bags -> standard transfer is a decision that was made for the a3m preservation engine, and I decided to retain this convention when adding Archivematica as a preservation engine option.

jraddaoui commented 6 months ago

I'll create another issue talking about that bundle activity, this is all looking forward to have an extensible pre-processing option and it will help if we have a child workflow for those activities later on. Then we should discuss where should the bundle activity be located (if needed), looking at the conceptual design bundling seems like a responsibility for pre-processing. And in the SFA fork we are skipping the bundle activity right now.

djjuhasz commented 6 months ago

@jraddaoui okay, but I don't see any point in making the AM transfer type configurable without addressing bundle activity - Enduro will always deliver a zipped standard transfer to AM. In the SFA case you've already modified the Enduro code, so just changing the transfer type in the code is a simpler solution then adding a config variable.

sallain commented 3 months ago

Note from today's meeting: @djjuhasz, @jraddaoui, and @sallain to review this issue and decide what pieces of work need to be completed to support SFA and MoMA.

djjuhasz commented 3 months ago

I have a proposal for how to handle the SIP format delivered to the preservation system by Enduro. My proposal is based on the supposition that a BagIt Bag is the best SIP format for Enduro to send to the preservation system, but recognizes that a3m currently can't process Bagged SIPs.

I believe a BagIt Bag should be the preferred SIP format because:

  1. It's an open standard, and Artefactual prefers implementing open standards where possible.
  2. There is existing tooling to create and validate Bags, which includes validating file checksums and the package contents vs. manifest. Using the existing Bag tools saves us work of implement and maintaining our own SIP creation and validation tools.

My proposed solution for the Enduro SIP type

  1. Remove the Bundle activity from Enduro, and make repackaging the SIP (when necessary) a concern of Preprocessing. I think it makes sense for any code that modifies the structure or contents of a transfer to be implemented in Preprocessing.
  2. Specify that Preprocessing must deliver a BagIt Bag to Enduro upon successful completion. This allows us to run a Bag validator to confirm that the SIP produced by Preprocessing meets the expectations of Enduro (and the preservation system).
  3. Move the unbag function from the Bundle activity to a stand alone "Unbag Activity", which is only run in the a3m workflow. The Unbag Activity will convert the BagIt SIP from Preprocessing to an Archivematica standard transfer. If a3m implements Bag processing in the future, the Unbag Activity can be removed from Enduro.
  4. In the Archivematica preservation workflow, always send a Bagged SIP to Archivematica for preservation. Update the start transfer API request "Type" value to "zipped bag".

@sallain @jraddaoui what do you think? If you have a counter-proposal or any suggested modifications to my proposal, I'd love to hear your ideas.

sallain commented 3 months ago

I think that this is a good idea for the following reasons:

I also completely agree that this should all occur in pre-processing.

Here are a few things to consider:

I'm sure that there are other considerations as well, but for the most part I think that this is a solid proposal.

Diogenesoftoronto commented 3 months ago

I'm sure that there are other considerations as well, but for the most part I think that this is a solid proposal.

I would like to outline one of the considerations that is missing here. That consideration is that our current way of validating bags uses a very early, and not well tested bagit library in go. see https://github.com/nyudlts/go-bagit and https://github.com/nyudlts/go-bagit/issues/7#issuecomment-1613190552. It would require some work to make this fully featured and complaint bag validator according to spec.

djjuhasz commented 3 months ago

@sallain I agree that we should avoid rebagging a transfer that is submitted as a Bag and that adding Bag processing to a3m ASAP would avoid having to unbag the bag we just bagged. :P

@Diogenesoftoronto yes, good points about the https://github.com/nyudlts/go-bagit library. I was assuming we would use https://github.com/LibraryOfCongress/bagit-python for Bag validation, but it being a Python tool definitely makes it more challenging to integrate than a native Go library. It also looks like bagit-python is not being actively maintained, and requires Python 2 which was sunset in January 2020.

sallain commented 3 months ago

I was discussing this with @fiver-watson and he pointed out that there may be circumstances where a user submits a bag, but other activities in the pre-processing application mean that the original bag is invalid (ex. transforming or adding metadata files), meaning that the bag WOULD have to be rebagged. Just something to consider.

sallain commented 2 months ago

We spent some time last week hashing out a workflow diagram. This is the result. It can be found on the Implementation Services team Miro board

image

djjuhasz commented 2 months ago

@sallain the workflow diagram looks good to me. :+1: