artefactual-sdps / preprocessing-sfa

preprocessing-sfa is an Enduro preprocessing workflow for SFA SIPs
2 stars 0 forks source link

Feature: ensure that each file listed in metadata.xml actually exists in the SIP #35

Closed sallain closed 1 week ago

sallain commented 3 months ago

Is your feature request related to a problem? Please describe.

The metadata.xml file (for Digital born SIPs and Digitized SIPs) or UpdatedAreldaMetadata.xml file (for Digitized AIPs) contains a list of all objects in the content directory of the SIP. It is possible that the metadata file and the actual objects in the SIP might not match up. We should confirm that all of the objects listed in the metadata exist in the SIP, and vice versa.

Describe the solution you'd like

There are a few different ways to complete this check. One option is to use the checksums provided in the metadata file. In a separate issue (#22), I proposed that we should validate the incoming checksums from metadata.xml or UpdatedAreldaMetadata.xml before bag creation. We could extend this to accomplish two validations: ensuring that all of the items in the metadata file exist, and ensuring that they are uncorrupted.

This check should go both ways:

Describe alternatives you've considered

If we wanted to separate the checksum check from the object check, we could just use the file listing within metadata.xml or UpdatedAreldaMetadata.xml. Each file in the SIP is listed in <ordner><content><datei>, with the full name including extension in the <name> element.

<ordner>
    <name>content</name>
    <originalName>content</originalName>
    <ordner>
        <name>d_0000001</name>
        <originalName>d_0000001</originalName>
        <datei id="_zodSTSD0nv05CpOp6JoV3X">
            <name>00000001.jp2</name>
            <originalName>00000001.jp2</originalName>
            <pruefalgorithmus>MD5</pruefalgorithmus>
            <pruefsumme>dc29291d0e2a18363d0efd2ec2fe81c9</pruefsumme>
        </datei>
        <datei id="_rlPKJX9ZcAl4ooc4IfoIkM">
            <name>00000002.jp2</name>
            <originalName>00000002.jp2</originalName>
            <pruefalgorithmus>MD5</pruefalgorithmus>
            <pruefsumme>9093907ec32f06fe595e0f14982c4bf0</pruefsumme>
        </datei>
    </ordner>
</ordner>

Additional context

Add any other context or screenshots about the feature request here.

jraddaoui commented 3 months ago

Related: https://github.com/artefactual-sdps/preprocessing-sfa/issues/22#issuecomment-2145339181

djjuhasz commented 3 months ago

@sallain I've added a "verify manifest" activity to the preprocessing workflow in PR #38. I invented validation messages based on the other validation activities, but you may want to change the text.

Successful validation

Name: "Verify SIP manifest", Message:

SIP contents match manifest

Validation failure message

Name: "Verify SIP manifest", Message:

Content error: SIP contents do not match "UpdatedAreldaMetadata.xml":
Missing file: d_0000001/00000001.jp2
Unexpected file: d_0000001/extra_file.txt

Note that the file paths are relative, but I can change them to absolute paths or add more of the relative path segments (e.g. "content/content/d_0000001/...")

djjuhasz commented 3 months ago

@sallain P.S. I sandwiched "verify manifest" between the "validate structure" and "validate file format" activities — let me know if I should move it.

sallain commented 2 months ago

Per our discussion, the check should also confirm that files located in the header directory are present.

fiver-watson commented 2 months ago

The new "Verify SIP Manifest" check seems to be happening too early - it is failing on valid samples, because as other checks progress, Enduro creates a metadata directory and a PREMIS XML file to start capturing events... only this PREMIS file is not in the manifest, and therefore all samples fail.

Attaching a sample that does not include a metadata directory or a premis.xml file. It nevertheless fails on the "Verify SIP manifest" activity with the following error:

Content error: SIP contents do not match "metadata.xml": Unexpected file: metadata/premis.xml
fiver-watson commented 2 months ago

Test sample attached little_vecteur_sip.zip

fiver-watson commented 2 months ago

Even more confusing - the little_vecteur_aip.zip PASSES, though I don't understand why the behavior is any different here.

djjuhasz commented 2 months ago

I think we should verify the SIP manifest before we create the premis.xml file - the manifest only lists the original contents of the package. The difference between the vecteur sip and aip is probably due to the difference in structure between the two SIP types, and how the respective manifests list the package contents.

fiver-watson commented 2 months ago

Seems to be working better now - though there are some ongoing issues with PREMIS writes to the AM METs, this issue is no longer a blocker.

The one edge case that passed manifest validation: adding an extra directory (but no extra files).

In the content directory, I added an additional d_0000002 folder next to the actual d_0000001 directory with the files. I expected this to trigger a manifest error, but it didn't.

As I understand it, checking the manifest is ensuring that:

By those criteria, QA has passed successfully.

Open question for the team, however: SHOULD we be adding extra checks for unexpected directories, even if they are empty???

djjuhasz commented 2 months ago

@fiver-watson I think it's up to SFA to decide if unexpected or missing directories (with no contents) should halt ingest.