Closed sallain closed 1 week ago
@sallain I've added a "verify manifest" activity to the preprocessing workflow in PR #38. I invented validation messages based on the other validation activities, but you may want to change the text.
Successful validation
Name: "Verify SIP manifest", Message:
SIP contents match manifest
Validation failure message
Name: "Verify SIP manifest", Message:
Content error: SIP contents do not match "UpdatedAreldaMetadata.xml":
Missing file: d_0000001/00000001.jp2
Unexpected file: d_0000001/extra_file.txt
Note that the file paths are relative, but I can change them to absolute paths or add more of the relative path segments (e.g. "content/content/d_0000001/...")
@sallain P.S. I sandwiched "verify manifest" between the "validate structure" and "validate file format" activities — let me know if I should move it.
Per our discussion, the check should also confirm that files located in the header
directory are present.
The new "Verify SIP Manifest" check seems to be happening too early - it is failing on valid samples, because as other checks progress, Enduro creates a metadata directory and a PREMIS XML file to start capturing events... only this PREMIS file is not in the manifest, and therefore all samples fail.
Attaching a sample that does not include a metadata directory or a premis.xml file. It nevertheless fails on the "Verify SIP manifest" activity with the following error:
Content error: SIP contents do not match "metadata.xml": Unexpected file: metadata/premis.xml
Test sample attached little_vecteur_sip.zip
Even more confusing - the little_vecteur_aip.zip PASSES, though I don't understand why the behavior is any different here.
I think we should verify the SIP manifest before we create the premis.xml file - the manifest only lists the original contents of the package. The difference between the vecteur sip and aip is probably due to the difference in structure between the two SIP types, and how the respective manifests list the package contents.
Seems to be working better now - though there are some ongoing issues with PREMIS writes to the AM METs, this issue is no longer a blocker.
The one edge case that passed manifest validation: adding an extra directory (but no extra files).
In the content
directory, I added an additional d_0000002
folder next to the actual d_0000001
directory with the files. I expected this to trigger a manifest error, but it didn't.
As I understand it, checking the manifest is ensuring that:
By those criteria, QA has passed successfully.
Open question for the team, however: SHOULD we be adding extra checks for unexpected directories, even if they are empty???
@fiver-watson I think it's up to SFA to decide if unexpected or missing directories (with no contents) should halt ingest.
Is your feature request related to a problem? Please describe.
The metadata.xml file (for Digital born SIPs and Digitized SIPs) or UpdatedAreldaMetadata.xml file (for Digitized AIPs) contains a list of all objects in the content directory of the SIP. It is possible that the metadata file and the actual objects in the SIP might not match up. We should confirm that all of the objects listed in the metadata exist in the SIP, and vice versa.
Describe the solution you'd like
There are a few different ways to complete this check. One option is to use the checksums provided in the metadata file. In a separate issue (#22), I proposed that we should validate the incoming checksums from metadata.xml or UpdatedAreldaMetadata.xml before bag creation. We could extend this to accomplish two validations: ensuring that all of the items in the metadata file exist, and ensuring that they are uncorrupted.
This check should go both ways:
Describe alternatives you've considered
If we wanted to separate the checksum check from the object check, we could just use the file listing within metadata.xml or UpdatedAreldaMetadata.xml. Each file in the SIP is listed in
<ordner><content><datei>
, with the full name including extension in the<name>
element.Additional context
Add any other context or screenshots about the feature request here.