artefactual-sdps / preprocessing-sfa

preprocessing-sfa is an Enduro preprocessing workflow for SFA SIPs
1 stars 0 forks source link

Feature: combine METS and metadata files for delivery to AIS #77

Closed sallain closed 1 week ago

sallain commented 2 weeks ago

Is your feature request related to a problem? Please describe.

DPS must deliver both the METS file and the metadata.xml/UpdatedAreldaMetadata.xml file to the AIS during the post-preservation workflow. However, AIS only expects one file.

Describe the solution you'd like

Combine the METS and the metadata.xml/UpdatedAreldaMetadata.xml files together into one metadata file. For migration files (files identified as DigitizedAIP or BornDigitalAIP), UpdatedAreldaMetadata.xml should be used.

The newly created file should be named with the prefix AIS_ followed by the accession number, which can be found in the metadata.xml (or UpdatedAreldaMetadata.xml, but should be the same value) under <ablieferungsnummer>. There should only be one ablieferungsnummer per metadata file. The number is formatted as 2002/05; the / should be replaced with an _. The final file name will be AIS_2002_05.

Within the file, SFA would like the contents of metadata.xml/UpdatedAreldaMetadata.xml first, since it contains the higher hierarchies, and then the METS. The contents of the two files should probably be tagged in some way but I think it can be pretty simple - perhaps just indicating the source file.

Describe alternatives you've considered

None

Additional context

There's a very real chance that, when operating at scale, the resulting file will be too big for AIS to handle; it might make sense to then limit which fields from each file we're combining into this new file. But we'll tackle that if/when it happens.

djjuhasz commented 2 weeks ago

@sallain I originally planned to try and merge the SFA Arelda metadata into the METS XML as a proper XML document with one root node and proper namespacing. I see now that SFA would like the Arelda metadata first in the document, and I've also realized that adding the Arelda XML inside the METS XML is going to be quite a bit of work. So, I've settled for now on just concatenating the two XML files with the Arelda first and the METS second. It's a work in progress (still needs testing) but I think the concatenation code should work now: https://github.com/artefactual-sdps/preprocessing-sfa/tree/dev/issue-77-combine-ais-metadata

djjuhasz commented 1 week ago

Attached is a zipped AIS package created by Enduro with the combined AIS metadata file. search-md_little_digitized_sip-15da98b9-5953-46dd-8dc9-2b31ee544bff.zip

Note that the current name of the AIS metadata file is "AIS_1974_47_3578513" with no file extension. From the description above I think that's what the filename should be, but let me know if I should and an extension (e.g. ".xml").

sallain commented 3 days ago

Results as expected!