Closed sallain closed 3 months ago
Use our existing tools for bagging and bag validation
This is already done as part of https://github.com/artefactual-sdps/preprocessing-sfa/pull/11. However, their repackage_sip.py
script was also trying to validate the file checksums generated on Bag creation against those included in the metadata XML files.
They were generating the Bag manually, writing the manifest with the checksums from the metadata file and then validating the final Bag, which allowed to calculate the checksums only once and validate them at the same time. Now, we'll be generating the Bag automatically in the child workflow and validating it in Enduro, so we will be ignoring those metadata checksums.
To avoid generating the checksums twice, @fiver-watson suggested to validate them after Bag creation, using the generated checksums from the Bag manifest and validate them against those in metadata.xml (or UpdatedAreldaMetadata.cml), I added a TODO comment at the end of the workflow code to do that.
That will imply knowing the transformation made to the SIP during the workflow to match the file paths from both sources, which should be easy to do, but it adds a validation activity at the end of the workflow after transformation and Bag creation and I wonder if that is a good choice, considering all validation tasks will happen before and will show up in the UI in that order. They are using MD5 checksums, which are not so expensive to calculate (compared to sha algorithms), so I wonder if validating them before transformation makes more sense, even if that requires calculating them twice. @fiver-watson, @sallain thoughts?
@jraddaoui I would be okay with doing it that way. However, another factor for consideration... (cc @sallain to make sure i am getting this right):
During our most recent metadata WG meeting, SFA shared what they would ideally like to see us do with the metadata.xml / UpdatedArelda file - i.e. extract the userful information and turn it into other AM inputs to be included in the SIP, as well as create a version to send to the AIS (details on that to come - waiting on a copy of their current XSD and some example input / output files). See slide 2 in this deck from SFA:
If we proceed with this work, we will need to find and extract all the MD5 checksums early in the workflow anyway, before anything is sent to Archivematica or bagged etc. If that's the case, then it makes sense to me to use the checksum file we generated to quickly validate them as well, since we would now have a file relating each checksum to an object (per Archivematica's requirements for the checksum file).
Haven't done all the analysis yet on these potential new ingest extraction tasks, but wanted to mention this early in case it influences how we approach this issue!
What's the point of that checksums file if we are passing a Bag? I think the Bag will be validated twice already (Enduro and AM), including the manifest files with the checksums.
A) They want to keep track of the original checksums for chain of custody B) They want the original checksums sent to the AIS
AFAIK the bagging is our process, not something they care about. They want the same checksums that were generated during the digitization process to be maintained.
I'm not sure if there's a way to use the existing MD5 checksums in bag creation, but for this particular case if we are going to bag the content, that would be an option too.
They want to keep track of the original checksums for chain of custody
We are keeping the original metadata XML file.
They want the original checksums sent to the AIS
We/They could use the ones in the metadata XML file too, or the ones from the Bag.
lol, just passing on what i heard - let's discuss during sync
When we talked to SFA during that metadata WG meeting, I forgot that Archivematica will be receiving zipped bags and therefore the bag's payload manifest will contain checksums, meaning that the checksum.md5 file is not needed. What SFA needs is for the incoming checksums to be validated and to be able to pass those same checksum values to AIS. This is what I propose:
Once Archivematica receives the bag, it will run a fixity check on the bag checksums, which Enduro will have already confirmed match the metadata.xml/UpdatedAreldaMetadata.xml checksums.
Archivematica will generate two PREMIS events: a message digest calculation
event and a fixity check
event. The first is Archivematica's independent checksum generation; the second is the comparison between that and the bag checksum. Finally, at the end of its workflow Archivematica will validate the bag as a whole, including the checksums, before sending to storage.
This actually feels pretty solid to me - checksums are validated at upload, bag generation, when sent to Archivematica, and just before storage.
As to which source Enduro then uses to pass the checksums to AIS, it doesn't really matter as long as the same algorithm is used throughout (SFA is an md5 shop) - the checksums all match, so whether AIS picks it up from the METS, the metadata file, or another source should be irrelevant.
Any concerns/is there anything I've missed?
For the schema validation I thought we could try the following approach:
https://github.com/lestrrat-go/libxml2#xsd-validation
This package is a Go interface binding the libxml2
C library. Even if it should be considered alpha, it may be good enough for our needs.
Other options:
We discussed this at some point but adding a note here for reference. I had a chat with @sevein about using C bindings in Go (CGO), he mentioned that it has several implications:
gcc
, slowing down compilation time (it can be cached, but we'll need to consider that in Tilt and other envs).os
functions using CGO instead of native Go, although it can be configured).Considering all that we may be better as we are, with the benefits of running a subprocess (he mentioned the memory issues in AM running lxml/libxml2). Additionally, he suggested that we could experiment creating a WebAssembly module and running it with wazero.
Moved to #39 and #40.
Is your feature request related to a problem? Please describe.
Pilot project work made use of Python scripts provided by SFA. In PoC#1 we did a bit of cleanup, adding some golang wrappers and handling. However, as SFA's needs evolve, the original scripts need some updating. Rather than updating the Python and having to maintain that dependency, we should rewrite them in Go so that they are aligned with Enduro's core language.
A second aspect of the validation scripts that we should amend is the use of locally stored XSDs. This was acceptable when we were only validating one type of SIP, but now we must support multiple transfer types which may comply with a different version of the XSDs. We should use the XSDs that are included in each SIP in order to ensure that the correct XSD version is used.
Describe the solution you'd like
This is a catch-all card for updating the scripts in general. The following list is a start:
More tasks can be added to this card as needed!
Describe alternatives you've considered
None
Additional context
Add any other context or screenshots about the feature request here.