artefactual-sdps / enduro

A tool to support ingest and automation in digital preservation workflows
https://enduro.readthedocs.io/
Apache License 2.0
4 stars 3 forks source link

Problem: PREMIS files are not validated #951

Open sallain opened 4 months ago

sallain commented 4 months ago

Is your feature request related to a problem? Please describe.

Whether generated by Enduro (through a child workflow) or included in a SIP, PREMIS XML files should be validated before the package is sent to preservation. Archivematica/a3m can parse a PREMIS file to add the file's events to the AIP METS, which happens quite late in the AM/a3m workflow - ensuring that the PREMIS file is valid will hopefully avoid errors at this late point.

Describe the solution you'd like

Add a new activity to validate the premis.xml file against the PREMIS v3 schema before sending to AM/a3m, ensuring that it's well-formed and valid.

PREMIS files generated by Enduro child workflows should always be validated. A PREMIS file included in a transfer may have been validated in advance, so it might not be necessary to validate these. A reasonable approach might be to validate any PREMIS file in the SIP's metadata directory, regardless of origin, as this is the file that will be picked up by Archivematica/a3m.

Describe alternatives you've considered

None

Additional context

sallain commented 4 months ago

premis-annotated-multi.zip

sallain commented 4 months ago

Note: I've only listed validating against the schema as a first iteration. Other checks might include:

fiver-watson commented 3 months ago

Note additionally that this is something that will be used repeatedly for any Enduro user performing custom ingest activities that might generate PREMIS, and/or anyone submitting their own PREMIS files with a SIP. For this reason, ideally this will be implemented as a reusable temporal activity, rather than a client-specific child workflow.

fiver-watson commented 3 months ago

@mcantelon also, as discussed in the meeting today:

Let's make this a general "Validate XML" task for its first pass, that can accept both a file to validate and a schema file to use for the validation as inputs.

mcantelon commented 2 months ago

PR for CR: https://github.com/artefactual-sdps/temporal-activities/pull/21

jraddaoui commented 2 months ago

There are some comments about this issue in https://github.com/artefactual-sdps/preprocessing-sfa/issues/22#issuecomment-2223129249.