artefactual-sdps / enduro

A tool to support ingest and automation in digital preservation workflows
https://enduro.readthedocs.io/
Apache License 2.0
4 stars 3 forks source link

Feature: Validate BagIt bags #947

Closed djjuhasz closed 1 month ago

djjuhasz commented 4 months ago

When Enduro receives a BagIt Bag as a SIP or from pre-processing, it currently converts the bag to an Archivematica Standard Transfer and sends this to preservation as the PIP (preservation information package). If the bag has been corrupted in delivery to Enduro then the PIP will still be sent to preservation, at the cost of unnecessary system resource usage and network traffic.

Describe the solution you'd like

It would be better to validate Bags in Enduro to catch errors early (e.g. file corruption) and avoid unnecessary system resource or network usage. The bags should be validated just before they are zipped and delivered to preservation (see diagram below) to validate bags after possible pre-processing changes. If bag validation fails, halt the Enduro workflow and send the SIP to a "failed SIP" location.

image

Describe alternatives you've considered

  1. Validate bags in preservation. As noted, this wastes system and network resource unnecessarily.
  2. Validate SIPs that are bags upon receipt. This would require a separate bag validation step for any bags produced by pre-processing.

Additional context

See issue #805 for more discussion of the proposed Enduro Bagit workflow.

sallain commented 1 month ago

In Enduro, we can see that the bag is being created and then validated. It's hard to look at the bag at this point, since it moves rapidly into Archivematica's processing queue, but looking at the Archivematica tasks we can see that Archivematica treats the transfer as a bag, unzips it, and verifies it. This two-part check gives me enough confidence to call this closed.