archivematica / Issues

Issues repository for the Archivematica project
GNU Affero General Public License v3.0
16 stars 1 forks source link

Problem: There is an upstream issue with PIM validation of PREMIS 3 #655

Open ross-spencer opened 5 years ago

ross-spencer commented 5 years ago

Expected behaviour

PIM validation works for PREMIS 3.

Current behaviour

I have logged this issue with the people at FCLA.

Steps to reproduce

Given an Archivematica qa/1.x/1.10 METS file:

You can download this schematron.

And run it with this script:

# -*- coding: utf-8 -*-

"""Test module to understand PREMIS schematron from FCLA."""

from __future__ import print_function
from lxml import etree, isoschematron

SCHEMATRON = "pim.stron"
METS = "mets.xml"

def _get_schematron():
    """Return a schematron object."""
    with open(SCHEMATRON, "r") as f:
        sct_doc = etree.parse(f)
    return isoschematron.Schematron(sct_doc, store_report=True)

def report_failures(doc):
    """Validate the XML and return the result."""
    schematron = _get_schematron()
    result = schematron.validate(doc)
    report = schematron.validation_report
    return result, report

def get_failures(report):
    """Look for failures in a schematron output.

  Looking for examples like the following:

    <svrl:failed-assert
        test="count(//mets:xmlData/pre:object) +
        count(//mets:xmlData/pre:agent) +
        count(//mets:xmlData/pre:rights) +
        count(//mets:xmlData/pre:event) &gt; 0 or
        count(//mets:xmlData/pre:premis) = 1"
        location="/*[local-name()='mets' and
        namespace-uri()='http://www.loc.gov/METS/']">
      <svrl:text>
          There must be PREMIS elements inside the METS container.
        </svrl:text>
    </svrl:failed-assert>

  """
    failures = report.xpath(
        "//svrl:failed-assert",
        namespaces={"svrl": "http://purl.oclc.org/dsdl/svrl"}
    )
    out = ""
    for res in failures:
        out = "{}{}\n".format(
            out,
            res.find(
                "svrl:text",
                namespaces={"svrl": "http://purl.oclc.org/dsdl/svrl"}
            ).text.strip(),
        )
    return out

with open(METS, "r") as valid:
    doc = etree.parse(valid)

result, report = report_failures(doc)
if report:
    print("METS, valid:", result, get_failures(report).strip())

And you will see the error: METS, valid: False There must be PREMIS elements inside the METS container.

Initially i thought there was a specific type of input causing this, and couldn't quite understand the pattern, but given more time to understand the issue, it stands to reason that any METS output by Archviematica right now is causing this issue.

Users can also recreate this issue via the Archivematica or PIM user interface, and they will see:

image

Your environment (version of Archivematica, OS version, etc)

qa/1.x


For Artefactual use: Please make sure these steps are taken before moving this issue from Review to Verified in Waffle:

evelynPM commented 4 years ago

I confirmed that the issue is still happening in PIM with METS files produced by AM 1.10.1-qa, but the METS files validate in other online validators such as http://xmlvalidator.new-studio.org/ and https://www.freeformatter.com/xml-validator-xsd.html. PIM verifies that the way the PREMIS entities are embedded in METS follows guidelines, which the other validators don't do, but we haven't made any changes to the way we wrap PREMIS in METS so that shouldn't matter.