OCFL / Use-Cases

A repository to help capture, track, and discuss use cases for OCFL. Issues-only, please.
7 stars 0 forks source link

EPrints-Archivematica Export Structure Compatibility #29

Closed photomedia closed 5 years ago

photomedia commented 6 years ago

I apologize in advance if this is not the best place to raise this question - if that is the case, please direct me to the more appropriate place (I did notice that there are a number of different email lists and a slack channel for this group).

I was introduced to OCFL at OR2018, and I immediately saw the potential to have this inform something that I am working on as well as be a bridge across repository systems. At the same OR2018, I co-presented a proposal for an export format for EPrints-to-Archivematica, for preservation. This format uses a folder stucture, and ideally, it would be optimal if this folder structure was compatible with the OCFL.

Here are the details of the proposal: https://spectrum.library.concordia.ca/983933/

Right away, I see two places where there is a divergence between that and OCFL, and I want to explore/discuss it:

1) The last modified date is placed right into the folder name of the top level object in our proposal. This also means that the entire object is replicated whenever any modification is made. This is not efficient in terms of storage space, but it has its own advantages of clarity and ease of retrieval later on. The OCFL uses a sequential "version 1...x" folder with changed files only.

2) In our proposal, BagIt is used for creating manifests - whereas in OCFL uses the inventory.jsonld format for this.

I suppose that I am looking to understand the reasoning behind OCFL's choices, and if these are compelling, possibly modify my proposal/plan.

ahankinson commented 6 years ago

Hi Tomasz!

OCFL joins the lineage of BagIt-inspired specs. It is most like the Moab spec developed by Richard Anderson (http://journal.code4lib.org/articles/8482), but is currently being developed to address some of the problems and potential optimizations identified by Moab's implementers. The inventory.jsonld file that sits in the root is envisioned as a means of tracking the contents of an object, in much the same way as the manifest.txt file does in a BagIt bag.

The main advantage of OCFL over BagIt is the ability to store versioned contents. We are trying to bake versioning in to the spec so that the changes to an object over time can be programmatically determined. As you might imagine, however, the addition of versioning brings with it a host of issues that BagIt didn't have to deal with.

The problem of file modification and content duplication is a tricky one, and one that we are currently trying to figure out (see #26, for example). Since we're looking at including in-scope files and data collections that may (potentially) be petabytes in size, storage efficiency is a high priority. To this end, and combined with our work on versioning, we are looking at methods of forward-versioning and content addressability (through hashes) to address this.

If you are looking for further discussion, I would encourage you to join the OCFL Community Google Group and join our monthly community calls. (https://groups.google.com/forum/#!search/ocfl-community). Or hop onto the Slack channel.

zimeon commented 5 years ago

We are using Archivematica at Cornell and my hope is that we will take the Archivematica AIP produced and make that a version of an OCFL object in archival storage. (If we update the content and reprocess that would then become v2 etc.)

ahankinson commented 5 years ago

F2F 2018.09.05: Discussion of this issue resulted in a decision that the spec would support this, but this does not have a direct bearing on the shape of the spec.