DILCISBoard / E-ARK-CSIP

E-ARK Common Specification for Information Packages
http://earkcsip.dilcis.eu
Creative Commons Attribution 4.0 International
11 stars 5 forks source link

It should be clear which version is a CSIP/SIP/AIP/DIP compatible with #703

Open luis100 opened 1 year ago

luis100 commented 1 year ago

Related to comment https://github.com/keeps/commons-ip/issues/166#issuecomment-1618865639

When validating the CSIP/SIP/AIP/DIP is difficult to know against which version of the specification we should validate against, or at least to know for which version of the specification a package was created for.

luis100 commented 1 year ago

Use cases:

  1. A SIP package was created for specification version 2.1.0, and it is not compatible with 2.0.4. This package is to be submitted into an archive which supports 2.0.4 but not 2.1.0.
  2. A SIP package was created for specification version 2.1.0 but it is also compatible with 2.0.4. This package is to be submitted into an archive which supports 2.0.4 but not 2.1.0.
karinbredenberg commented 1 year ago

That is usually achieved by having the profile pointer go to the correct version of the profile. So instead of pointing to the general one pointing to: https://earkcsip.dilcis.eu/profile/E-ARK-CSIP-v2-0-4.xml The METS_Profile:@ID have been used to give the version number. For the different CITS there are version numbers included in the values used in CSIP4.

prettybits commented 1 year ago

While that would be one way to mark the intended version this doesn't seem to be reliably possible at the moment as I can't find an equivalent versioned document URL for the profiles of v2.1.0 of the specifications? Only "archived" versions of the profiles of previous versions of the specifications get versions in their names but the current version would need to have that as well. It would be good to have that and a general requirement to use stable URLs as part of CSIP6.

On the side of the profile itself, is it documented anywhere that the @ID attribute of the METS_Profile root element should be used to hold the version? METS Profiles as currently documented don't say anything about intended values besides specifying it as a generic xml:id and there aren't any explicit provisions about versioning as stated in the section on Related profiles:

A profile may indicate its relationship with other METS profiles. METS profiles are not explicitly versioned, as implementations may exist that use older editions of METS profiles. Therefore a new version of a profile must be registered as a new profile. In this case, the RELATIONSHIP attribute should be used to indicate that a profile supersedes a profile already registered with the Library of Congress Network Development and MARC Standards Office. For each related profile, the profile should specify a URI for the related profile and the nature of the relationship between the current profile and the related profile.

The URI element in the profiles also currently doesn't use versioned URLs regardless of URL name, as a consequence of the aforementioned:

<URI LOCTYPE="URL" ASSIGNEDBY="local">https://earkcsip.dilcis.eu/profile/E-ARK-CSIP.xml</URI>

This is equally the case for the vocabularies which I think should be thought of as bound to particular versions as well?

For the different CITS there are version numbers included in the values used in CSIP4.

Some of the allowed values contain version strings, I assume the other entries at the beginning are all legacy and shouldn't be used for new packages? If additional (validation) resources are available (e.g. an XML schema for the CITS ERMS v2.1.0) the same considerations should apply I believe.

luis100 commented 1 year ago

In my opinion, pointing to patch version does not make much sense, but the specific version should be stated. Just noting that PREMIS schema seems to do a good job with with namespace and version.

https://www.loc.gov/standards/premis/v3/premis-v3-0.xsd

The namespace is "http://www.loc.gov/premis/v3" which identifies the major version, and we assume schemas will maintain retrocompatibility within the major version. Also PREMIS states the version on an attribute.

<xs:complexType name="premisComplexType">
<xs:sequence>
<xs:element ref="object" maxOccurs="unbounded"/>
<xs:element ref="event" minOccurs="0" maxOccurs="unbounded"/>
<xs:element ref="agent" minOccurs="0" maxOccurs="unbounded"/>
<xs:element ref="rights" minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence>
<xs:attribute name="version" type="version3" use="required"/>
</xs:complexType>

<!-- 
************** version definition
 -->
<xs:simpleType name="version3">
<xs:restriction base="xs:string">
<xs:enumeration value="3.0"/>
</xs:restriction>
</xs:simpleType>

The same could be accomplished in METS profile, pointing to the METS profile major version and adding an attribute to state the specific version used.

In the meantime @karinbredenberg solutions could be used, but as @prettybits states, we should published the latest version under its own version name and add information about the intended use in the specification.

carlwilson commented 12 months ago

The specifications have a fair number of "moving parts" some of which are versioned, some of which are not. Even with the existing state of play, this causes problems as:

The official registry of METS profiles does not support profile versioning. See https://www.loc.gov/standards/mets/profile_docs/mets.profile.v2-0.html#related_profile. This means the profile has no convenient attribute to record the version.

E-ARK uses the optional ID attribute of the root element, which isn't used consistently:

Old versions of the IP specifications are archived and made available under versioned file names. Using the CSIP as an example:

This approach has been introduced as needed and is inconsistent, at least where the current version is concerned.

The versioning of METS Profiles is made trickier by their lack of support for version numbering. Creating an extension to the profile schema to do so seems an extreme solution. Continuing to use the ID attribute is the simplest solution, but it should be used consistently. The proposal is that all specifications record ONLY the version number in the root METS_Profile element ID attribute: <METS_Profile ID="V2.1.0">. Parsers of the version number should be agnostic to case, and indeed, any prefix due to the existing instances. The version number should be a valid XML ID value.

The URI/URL for the profiles should reflect the version number in the URL path, as should every other resource. This should be consistently applied across all URL forms unless there is a compelling reason not to do so. A proposed style for the URL is to version for the major and minor numbers only.

PROPOSED CHANGES

All profiles to use the root element ID attribute consistently:

Treatment of the leading v should not take case into account, so for the above examples v2.1.0 and V2.1.0 should parse successfully. In practise it will be necessary to remove all alpha prefixes due to the existing longer ones that include the package type, e.g. SIPV2.1.0.

The URI/URL for the profiles should reflect the version number in the URL path, as should every other resource. This should be consistently applied across all URL forms unless there is a compelling reason not to do so. A proposed style for the URL is to use the explicit version in the filename part of the path and add a major version to the path part, similar to PREMIS listed by @luis100 above.

https://earkcsip.dilcis.eu/profile/v2/ will resolve to a directory listing showing all the current version 2 profiles.

Each version of a profile will reside under the appropriate major version and include the full version number in the file name, e.g. below the v2 directory above:

ALL references to a profile must include a full version number to avoid ambiguous references.

The supporting extension schema, vocabularies and vocabulary RNG schema will be versioned in an identical manner, e.g.

While we don't have different versions of these the changes will be made to retain consistency and open the possibility of future new versions. The extension schema namespace URI WILL NOT be versioned.

The vocabularies will be versioned as follows:

and so on. For completeness, the underlying vocabulary RNG schema will be versioned as well. There are currently no plans to amend these.

karinbredenberg commented 12 months ago

The issue is going to be discussed by the DILCIS Board

shsdev commented 9 months ago

Additional comments received from @prettybits during the Salamanca event.

It should be clear exactly which version of METS Profiles was used. For this purpose, it would be good to have a standalone attribute for the METS Profile ID. The profiles should also always state the exact version number; a "latest" URL should not be used in a METS XML, as it will then be unclear - especially after a while - which version was used.

jmaferreira commented 9 months ago

Hi everyone,

I read through all of the comments and I have to say that I'm a bit lost here. In my understading this topic has to do with choosing the correct package validation procedure. When processing an information package (IP) we need to parse the METS file placed in the root folder of the IP, and to acomplish that we need to know by which rules it should be validated.

This means that the METS file should clearly identify the METS Profile it adheres to so that the right validator/parser is invoked by the validation process.

The way this is done is by specifying the value of the <mets @PROFILE> attribute on the METS file and not by making changes to the METS Profile itself.

In the following example found on the SIP GitHub repository (https://github.com/DILCISBoard/E-ARK-SIP/tree/rel/v2.1.1/examples) we can see that the profile that is identified to validate this METS file is http://www.dasboard.eu/specifications/sip/v03/METS.xml (obsolete BTW).

<mets 
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
    xmlns="http://www.loc.gov/METS/" 
    xmlns:xlink="http://www.w3.org/1999/xlink" 
    OBJID="a46ab3d0-c710-4d73-b58d-e93e30b53a80" 
    TYPE="ERMS" 
    xlink:CONTENTTYPESPECIFICATION="SMURFERMS" 
    PROFILE="http://www.dasboard.eu/specifications/sip/v03/METS.xml" 
    xsi:schemaLocation="http://www.loc.gov/METS/ schemas/mets.xsd" 
    LABEL="root level METS file for an IP">
    <metsHdr CREATEDATE="2017-01-31T13:07:22.6970809+02:00" RECORDSTATUS="NEW" LASTMODDATE="2017-01-31T13:07:22.6970809+02:00">
...

I would argue that this should be used as the token to identify the right METS parser. I don't think it should be necessary to get the linked METS profile and inspect its ID to get the token that we need.

That being said, I believe that having an well defined @ID attribute on the METS profile is a good idea and something we should work on, I just don't think it is mandatory to fulfil the original goal of this issue.

The METS file generators out there should be enhanced to crearly identify the METS profile URL. The existing guides and specifications should stress this in their text.

@luis100 Am I missing something here?

luis100 commented 9 months ago

I think you are missing some aspects. One is that profiles were not being published with the version explicit, they were renamed with the version only when archived (i.e. when a new version came out), so the URL for the latest version was always versionless, which creates issues. This is a the profile/spec publication part of the issue.

Also, the METS Profile version and "METS parser", or generally the Specification version and Specification validator may not be one-to-one. There are changes in the specification and in the METS Profile that should be versioned but that don't necessarily create a new validator implementation. Generally, PATCH versions should not require changes in the validator implementation. So, we may point to the MINOR version, generally instead of pointing to https://earkcsip.dilcis.eu/profile/v2/E-ARK-CSIP-v2-0-4.xml we could point to https://earkcsip.dilcis.eu/profile/v2/E-ARK-CSIP-v2-0.xml and have the profile itself keep its PATCH version and changelog.

The other issue is the "flavour" we are validating, are we validating CSIP, SIP, AIP, CITS-SIARD, CITS-ERMS? Which versions of which? We want to do an hierarchical validation structure, but how do we identify the versions consistently. For example, CITS-SIARD needs to use the PROFILE https://citssiard.dilcis.eu/profile/E-ARK-SIARD-ROOT.xml, which does not have a version, nor identifies if it is a SIP/AIP/DIP nor what spec version of those it is. image (61)

Finally, a side note, all implementation suggestions here point to including this in the METS profile URL that is inside the METS file. Please note that knowing the profiles and versions is necessary to select the piece of code that will validate this file, so we will need to read the file twice, one for getting the package type and spec version (although this could be an empty file or malformed XML file), and another to actually do the validation. Any format detection tool (like Droid/PRONOM) will also need to use this information to identify the format and version of the file (if ZIPPED).

karinbredenberg commented 9 months ago

@carlwilson please look at this

jmaferreira commented 8 months ago

@karinbredenberg In order to vote for an approach on this topic, we really need to nail down the options available for voting. We will not have time during the DILCIS Board meeting to come up with ways of addressing this issue. The options available for voting should be clear to everyone.

karinbredenberg commented 8 months ago

The suggestion is in Carl's description:

The MEtsprofiles filename are stright away named with the version number and a folder structure where the main number is the the folder for all subversions. The ID attribute in the mets element is used for all. SAme to be implemented for extension schema and vocabularies.

karinbredenberg commented 7 months ago

The suggestion is:

Board members acknowledgment of the issue: Tick the box in front of you name to indicate that you have looked at the suggestion.

Voting (Decision making will be carried out on the basis of majority voting by all eligible members of the Board. In the case of a tied vote, decisions will be made at the discretion of the Chair)

Tick the box in front of you name to say yes to the suggestion.

karinbredenberg commented 6 months ago

7 DILCIS Board members have acknowledge the issue 6 DILCIS Board members agree with the solution

The ssuggested solution will be included in the next version.