Feature: Improve validation scripts to use Go and package XSDs

sallain commented 5 months ago

Is your feature request related to a problem? Please describe.

Pilot project work made use of Python scripts provided by SFA. In PoC#1 we did a bit of cleanup, adding some golang wrappers and handling. However, as SFA's needs evolve, the original scripts need some updating. Rather than updating the Python and having to maintain that dependency, we should rewrite them in Go so that they are aligned with Enduro's core language.

A second aspect of the validation scripts that we should amend is the use of locally stored XSDs. This was acceptable when we were only validating one type of SIP, but now we must support multiple transfer types which may comply with a different version of the XSDs. We should use the XSDs that are included in each SIP in order to ensure that the correct XSD version is used.

Describe the solution you'd like

This is a catch-all card for updating the scripts in general. The following list is a start:

[ ] Rewrite the overall validation to use golang
[ ] Use our existing tools for bagging and bag validation
[ ] Use whatever XSDs are included in the transfer for validation, instead of a local copy
[ ] Generalize the implementation so that task outcomes and details can be shown in the Enduro UI, per this separate dev card for PoC#2

More tasks can be added to this card as needed!

Describe alternatives you've considered

None

Additional context

Add any other context or screenshots about the feature request here.

jraddaoui commented 5 months ago

Use our existing tools for bagging and bag validation

This is already done as part of https://github.com/artefactual-sdps/preprocessing-sfa/pull/11. However, their repackage_sip.py script was also trying to validate the file checksums generated on Bag creation against those included in the metadata XML files.

They were generating the Bag manually, writing the manifest with the checksums from the metadata file and then validating the final Bag, which allowed to calculate the checksums only once and validate them at the same time. Now, we'll be generating the Bag automatically in the child workflow and validating it in Enduro, so we will be ignoring those metadata checksums.

To avoid generating the checksums twice, @fiver-watson suggested to validate them after Bag creation, using the generated checksums from the Bag manifest and validate them against those in metadata.xml (or UpdatedAreldaMetadata.cml), I added a TODO comment at the end of the workflow code to do that.

That will imply knowing the transformation made to the SIP during the workflow to match the file paths from both sources, which should be easy to do, but it adds a validation activity at the end of the workflow after transformation and Bag creation and I wonder if that is a good choice, considering all validation tasks will happen before and will show up in the UI in that order. They are using MD5 checksums, which are not so expensive to calculate (compared to sha algorithms), so I wonder if validating them before transformation makes more sense, even if that requires calculating them twice. @fiver-watson, @sallain thoughts?

fiver-watson commented 5 months ago

@jraddaoui I would be okay with doing it that way. However, another factor for consideration... (cc @sallain to make sure i am getting this right):

During our most recent metadata WG meeting, SFA shared what they would ideally like to see us do with the metadata.xml / UpdatedArelda file - i.e. extract the userful information and turn it into other AM inputs to be included in the SIP, as well as create a version to send to the AIS (details on that to come - waiting on a copy of their current XSD and some example input / output files). See slide 2 in this deck from SFA:

https://docs.google.com/presentation/d/1kkxvjLJ8OwdLo8FujHJQRNZ5eGOgOC6l/edit#slide=id.p2

If we proceed with this work, we will need to find and extract all the MD5 checksums early in the workflow anyway, before anything is sent to Archivematica or bagged etc. If that's the case, then it makes sense to me to use the checksum file we generated to quickly validate them as well, since we would now have a file relating each checksum to an object (per Archivematica's requirements for the checksum file).

Haven't done all the analysis yet on these potential new ingest extraction tasks, but wanted to mention this early in case it influences how we approach this issue!

jraddaoui commented 5 months ago

What's the point of that checksums file if we are passing a Bag? I think the Bag will be validated twice already (Enduro and AM), including the manifest files with the checksums.

fiver-watson commented 5 months ago

A) They want to keep track of the original checksums for chain of custody B) They want the original checksums sent to the AIS

AFAIK the bagging is our process, not something they care about. They want the same checksums that were generated during the digitization process to be maintained.

I'm not sure if there's a way to use the existing MD5 checksums in bag creation, but for this particular case if we are going to bag the content, that would be an option too.

jraddaoui commented 5 months ago

They want to keep track of the original checksums for chain of custody

We are keeping the original metadata XML file.

They want the original checksums sent to the AIS

We/They could use the ones in the metadata XML file too, or the ones from the Bag.

fiver-watson commented 5 months ago

lol, just passing on what i heard - let's discuss during sync

sallain commented 5 months ago

When we talked to SFA during that metadata WG meeting, I forgot that Archivematica will be receiving zipped bags and therefore the bag's payload manifest will contain checksums, meaning that the checksum.md5 file is not needed. What SFA needs is for the incoming checksums to be validated and to be able to pass those same checksum values to AIS. This is what I propose:

Validate the incoming checksums from metadata.xml/UpdatedAreldaMetadata.xml before bag creation
Generate the bag
Confirm that the incoming checksums match the newly-created bag checksums

Once Archivematica receives the bag, it will run a fixity check on the bag checksums, which Enduro will have already confirmed match the metadata.xml/UpdatedAreldaMetadata.xml checksums.

Archivematica will generate two PREMIS events: a message digest calculation event and a fixity check event. The first is Archivematica's independent checksum generation; the second is the comparison between that and the bag checksum. Finally, at the end of its workflow Archivematica will validate the bag as a whole, including the checksums, before sending to storage.

This actually feels pretty solid to me - checksums are validated at upload, bag generation, when sent to Archivematica, and just before storage.

As to which source Enduro then uses to pass the checksums to AIS, it doesn't really matter as long as the same algorithm is used throughout (SFA is an md5 shop) - the checksums all match, so whether AIS picks it up from the METS, the metadata file, or another source should be irrelevant.

Any concerns/is there anything I've missed?

jraddaoui commented 5 months ago

For the schema validation I thought we could try the following approach:

https://github.com/lestrrat-go/libxml2#xsd-validation

This package is a Go interface binding the libxml2 C library. Even if it should be considered alpha, it may be good enough for our needs.

Other options:

Create our own wrapper on top of xmllint.
https://github.com/terminalstatic/go-xsd-validate
https://github.com/metaleap/go-xsd
https://github.com/krolaw/xsd

jraddaoui commented 4 months ago

We discussed this at some point but adding a note here for reference. I had a chat with @sevein about using C bindings in Go (CGO), he mentioned that it has several implications:

Requires gcc, slowing down compilation time (it can be cached, but we'll need to consider that in Tilt and other envs).
It affects portability as binaries are no longer fully static and depend on external libraries, making them non-self-contained (we could embed those libs, but that may not be easy).
Alters the standard library build, potentially impacting specific functionalities (os functions using CGO instead of native Go, although it can be configured).
Introduces security risks with C code, such as memory management issues and overflows, which are not common in pure Go.

Considering all that we may be better as we are, with the benefits of running a subprocess (he mentioned the memory issues in AM running lxml/libxml2). Additionally, he suggested that we could experiment creating a WebAssembly module and running it with wazero.

jraddaoui commented 3 months ago

Moved to #39 and #40.

artefactual-sdps / preprocessing-sfa

Feature: Improve validation scripts to use Go and package XSDs #22