DILCISBoard / E-ARK-AIP

E-ARK AIP Specification
https://earkaip.dilcis.eu/
Creative Commons Attribution 4.0 International
8 stars 4 forks source link

Recommendation for storing and versioning AIPs without the use of BagIt #83

Open shsdev opened 2 months ago

shsdev commented 2 months ago

For the versioning of AIPs the plan is to recommend the use of OCFL.

Assuming the following structure for an original submission information package example.sip.001.tar stored as version v00000 and an AIP urn+uuid+81bd3aa2-7350-44f6-ad54-d8181858605a.tar stored as version v00001:

├── 0=ocfl_object_1.0
├── inventory.json
├── inventory.json.sha512
├── v00000
│   └── example.sip.001.tar
└── v00001
    └── urn+uuid+81bd3aa2-7350-44f6-ad54-d8181858605a.tar

The inventory.json could look as follows:

{
    "digestAlgorithm": "sha512",
    "fixity": {
        "md5": {
            "f97f90b429a84bdd0bfb88b6d037b351": [
                "v00000/example.sip.001.tar"
            ],
            "ed7f48df08c5c1f134c02dcfd9ff6098": [
                "v00001/urn+uuid+81bd3aa2-7350-44f6-ad54-d8181858605a.tar"
            ]
        },
        "sha256": {
            "47782fd210d3933bda9045d923c4370b2632c39826096bfa86b2860d07397742": [
                "v00000/example.sip.001.tar"
            ],
            "ac1d70378c2be8cf818b490292b51ccabba55b2192004eda67a7822f60072612": [
                "v00001/urn+uuid+81bd3aa2-7350-44f6-ad54-d8181858605a.tar"
            ]
        }
    },
    "head": "v00001",
    "id": "urn:uuid:81bd3aa2-7350-44f6-ad54-d8181858605a",
    "manifest": {
        "c676d28b0d0a5aa345aea7995fbdf36b06981923af85d2234de5157c11173c032435fc3da4f513c717bba4bb912d0f9c7165750c46ed821bffdc22def79606c7": [
            "v00000/example.sip.001.tar"
        ],
        "1aef284e408d991ba6abf9973c4bcb02c1a2a94c951cd119cd040249878ac2d2ce790d78c0416327fd92efaab8f0536be1b0fb0a2f17a8cbe0069b72d16f7988": [
            "v00001/urn+uuid+81bd3aa2-7350-44f6-ad54-d8181858605a.tar"
        ]
    },
    "type": "https://ocfl.io/1.0/spec/#inventory",
    "versions": {
        "v00000": {
            "created": "2024-04-09T21:08:54Z",
            "message": "Original SIP",
            "state": {
                "c676d28b0d0a5aa345aea7995fbdf36b06981923af85d2234de5157c11173c032435fc3da4f513c717bba4bb912d0f9c7165750c46ed821bffdc22def79606c7": [
                    "v00000/example.sip.001.tar"
                ]
            }
        },
        "v00001": {
            "created": "2024-04-09T21:08:55Z",
            "message": "AIP (ingest)",
            "state": {
                "1aef284e408d991ba6abf9973c4bcb02c1a2a94c951cd119cd040249878ac2d2ce790d78c0416327fd92efaab8f0536be1b0fb0a2f17a8cbe0069b72d16f7988": [
                    "v00001/urn+uuid+81bd3aa2-7350-44f6-ad54-d8181858605a.tar"
                ]
            }
        }
    }
}

Note that there is an overlap of fixity information which is provided in the METS already.

The question for voting is if the container files example.sip.001.tar for the original SIP and urn+uuid+81bd3aa2-7350-44f6-ad54-d8181858605a.tar for the AIP should be wrapped in a bagit container, for example:

├── bag-info.txt
├── bagit.txt
├── data
│   ├── metadata
│   │   ├── descriptive
│   │   │   └── ead.xml
│   │   ├── metadata.json
│   │   └── preservation
│   │       └── premis.xml
│   ├── METS.xml
│   ├── processing.log
│   ├── representations
│   │   └── 1710641a-bfa1-48cc-b41f-4220606679ae
│   │       ├── data
│   │       │   └── example.pdf
│   │       ├── metadata
│   │       │   └── preservation
│   │       │       └── premis.xml
│   │       └── METS.xml
│   └── state.json
├── manifest-sha256.txt
├── manifest-sha512.txt
├── tagmanifest-sha256.txt
└── tagmanifest-sha512.txt

Note that this way fixity information would possibly be provided in up to four layers:

To reduce complexity and redundancy, the proposal is store the E-ARK information package as TAR files instead of wrapping them as bagit containers as shown in the example above.

The E-ARK AIP container file urn+uuid+81bd3aa2-7350-44f6-ad54-d8181858605a.tar would then have the following form, for example:

urn+uuid+81bd3aa2-7350-44f6-ad54-d8181858605a
├── metadata
│   ├── descriptive
│   │   └── ead.xml
│   ├── metadata.json
│   ├── other
│   │   ├── processing.log
│   │   └── state.json
│   └── preservation
│       ├── premis_202401094-230854Z_event_sipcreation.xml
│       └── premis_20240409-230854Z_event_ingest.xml
├── METS.xml
├── representations
│   └── 09502a26-f822-407c-ad0a-4d7e64052a91
│       ├── data
│       │   └── example.pdf
│       ├── metadata
│       │   └── preservation
│       │       └── premis.xml
│       └── METS.xml
└── schemas
    ├── csip.xsd
    ├── ead3.xsd
    ├── IP.xsd
    ├── mets_1_11.xsd
    ├── premis-v2-2.xsd
    └── xlink.xsd

The suggestion is:

As part of the general AIP recommendations, the proposal is to store the E-ARK information package as TAR files instead of wrapping them as bagit containers.

luis100 commented 2 months ago

Hello,

I am not quite understanding the use cases that are meant to be supported. In my view, AIPs live independently of SIPs, but might be affected by them, such as they might be affected by other operations such as metadata enrichment, file format convertion or even redaction or destruction by retention processes. Note also that AIPs might be created by SIPs, or by operations, such when creating an AIC (and AIP relative to a collection or a case with only metadata).

Also, I am not very confortable with this level of complexity, the versioning should be an acessory level, that you could take it or leave it without much change to the overall format of the AIP.

As such, I would like to suggest the following:

  1. Leave the AIP layout and storing as it was
  2. Have versioning support nested in an optional sub-folder under the AIP (named for example as "versions"), even at the cost of redundancy
  3. Versions use the OCFL format, and are always expanded (i.e. no use of TAR or other containers), so it is quite clear what changes occured from previous version.
  4. The AIP usual folders, such as metadata and representation, will always provide the "current" version of the AIP.
  5. SIPs or any other "submission" might be archived in another folder (e.g. named "submissions") and have the submission datetime (in ISO 8601 with Z timezone but without ':') recorded as SIP updates tend to have the same file name. The SIP should be archived inside the AIP as it was received.
  6. Relationships between submissions, action and AIP versions should be described in PREMIS. As such, it should be possible for PREMIS events to refer to a specific version of an AIP as the source or outcome object.
  7. Versions and Submissions are optional, their use will have a significant impact in storage and processing and their use should be defined by implementation and easy to switch on and off.

Given this, I would like to suggest the following layout of an AIP with two versions, were the first version was created by a SIP and the second version is just an update of the descriptive metadata.

urn+uuid+81bd3aa2-7350-44f6-ad54-d8181858605a
├── metadata
│   ├── descriptive
│   │   └── ead.xml
│   ├── metadata.json
│   ├── other
│   │   ├── processing.log
│   │   └── state.json
│   └── preservation
│       ├── premis_202401094-230854Z_event_sipcreation.xml
│       └── premis_20240409-230854Z_event_ingest.xml
├── METS.xml
├── representations
│   └── 09502a26-f822-407c-ad0a-4d7e64052a91
│       ├── data
│       │   └── example.pdf
│       ├── metadata
│       │   └── preservation
│       │       └── premis.xml
│       └── METS.xml
└── schemas
│   ├── csip.xsd
│   ├── ead3.xsd
│   ├── IP.xsd
│   ├── mets_1_11.xsd
│   ├── premis-v2-2.xsd
│   └── xlink.xsd
└── versions
│   ├── 0=ocfl_object_1.0
│   ├── inventory.json
│   ├── inventory.json.sha512
│   ├── v00000
│   │      └── metadata/descriptive/ead.xml
│   │          (all other files in AIP originaly received from the SIP)
│   └── v00001
│          └── metadata/descriptive/ead.xml
|              (descriptive metadata updated on the digital preservation archive)
└── submissions
    └──2024-04-10T11-57-00Z
       └── example.sip.001.zip
carlwilson commented 2 months ago

I don't have strong feelings, and I don't have skin in the game. I'm not quite able to make like for like comparison because one or two points in @luis100 response aren't clear to me. I think you're suggesting no BAGIT and no use of TAR for OCFL versions. I'm inclined to agree about BAGIT; I'm not sure we gain much from its use and much of the metadata is redundant. I agree that using TAR archives in versions is perhaps a bit messy and obscures the content/metadata changed.

Versions and Submissions are optional, their use will have a significant impact in storage and processing and their use should be defined by implementation and easy to switch on and off.

Storage and processing impact IS and implementation detail to some degree. Institutional policy/budget/choice will also be a factor. Making them optional appears a sensible decision.

shsdev commented 2 months ago

Note that the OCFL format does not belong to the AIP. This is just one possible way how to store the original SIP (in this example as v00000) if you want to keep it, and the versioned AIPs are separate instances of AIPs. We moved away from integrating the versions into the AIP since E-ARK3.

Packaging as TAR/ZIP/etc. is a technical implementation detail that depends on storage system and requirements. It is adequate if the packages need to be transferred and may be a good approach if you have a tape system where the AIPs are stored for the long-term. However, if the AIPs are still being updated, the continuous re-packaging causes a lot of processing and redundancy.

The question here was about the use of BagIt to wrap E-ARK AIPs. In E-ARK, the manifest is included in the METS, but bagit has a simpler, non-XML format (payload manifest) for this purpose.

shsdev commented 4 weeks ago

The AIP working group discussed the use of BagIt and recommends to take it out of the main recommendation for AIP packaging. Instead, it would be moved to an appendix where it will be explained how to wrap E-ARK information packages using BagIt. As optional BagIt packaging is also relevant for the SIP, it should be added to the CSIP rather than to the AIP. This decision is independent from the use of OCFL which will be dealt with in a separate issue.

The suggestion is:

Board members acknowledgment of the issue: Tick the box in front of you name to indicate that you have looked at the suggestion.

Voting (Decision making will be carried out on the basis of majority voting by all eligible members of the Board. In the case of a tied vote, decisions will be made at the discretion of the Chair)

Tick the box in front of you name to say yes to the suggestion.