OCFL / Use-Cases

A repository to help capture, track, and discuss use cases for OCFL. Issues-only, please.
7 stars 0 forks source link

Package per version storage #33

Open zimeon opened 5 years ago

zimeon commented 5 years ago

In cases where there are many small files in an object or where the storage infrastructure is not efficient at handling many files, it is useful to package files using a technology such as ZIP. This is addressed for the whole object in #10. However, packaging the whole object as a ZIP/Tar etc. breaks the idea of immutability of version data. One could instead package the inventory and content for each new version as a new ZIP file.

zimeon commented 5 years ago

This could be something along the lines of:

[object root]
    ├── 0=ocfl_object_1.0
    ├── inventory.json
    ├── inventory.json.sha512
    ├── v1.zip
    ├── v2.zip
    └── v3.zip

this still leaves three potentially small files per object (though inventory.json might not be) but avoids any small files in the object's contents appearing alone in storage, while each v#.zip is immutable.

rosy1280 commented 3 years ago

potentially a sub-use case of #39

ThomasEdvardsen commented 3 years ago

Hello everybody! The National Library of Norway is in the process of installing a new bit repository (HPSS) that can hold 44 PB of data. In this context, we are considering using OCFL to organize our data packages.

So far, OCFL looks very good, but we are dependent on ZIP per version storage #33 being resolved to be able to use OCFL. This is because we want to limit the number of files so that it becomes more efficient to store/retrieve data from HPSS.

I reckon this needs to be solved using an object extension? Do you have any thoughts on how this can be implemented?

ThomasEdvardsen commented 3 years ago

We have begun to think about how this can be implemented based on our needs. This is a very immature first proposal for a new object extension.

We would like to discuss the following:

  1. Whether or not to include a full path to the archived files.
  2. Whether the archive files should be placed on the object's root, or in separate version folders.
  3. Whether a version can consist of more than one archive file.

Arguments for allowing more than one file for each version:

What are your initial thoughts?

[object root]
├── 0=ocfl_object_1.0
├── extensions/
│   └── nnnn-archived-versions/
│       ├── archived-versions.json
│       └── archived-versions.json.sha512
├── inventory.json
├── inventory.json.sha512
├── v1/
│   ├── v1-1.zip
│   ├── v1-2.zip
│   └── v1-3.zip
├── v2/
│   └── v2-1.zip
└── v3/
    ├── v3-1.zip
    └── v3-2.zip  

Example content of archived-versions.json

{
  "id": "zipped_updates_three_versions_one_file",
  "versions": {
    "v1": {
      "created": "2019-01-01T02:03:04.000Z",
      "archiveAlgorithm": {
        "mime": "application/zip",
        "pronomId": "x-fmt/263"
      },
      "digestAlgorithm": "sha512",
      "files": {
        "0675bdf376e92e9994612c33ea255b12f7": {
          "filePath": "/hpss/storage-root-01/1ec1/8fe/5cd/zipped_updates_three_versions_one_file/v1/v1-1.zip",
          "fileSize": 133410430
        },
        "0675b1ff76e92e9994612c33ea255b12f7": {
          "digestHex": "/hpss/storage-root-01/1ec1/8fe/5cd/zipped_updates_three_versions_one_file/v1/v1-2.zip",
          "fileSize": 520430330
        },
        "067ab1f376e92e9994612c33ea255b12f7": {
          "digestHex": "/hpss/storage-root-01/1ec1/8fe/5cd/zipped_updates_three_versions_one_file/v1/v1-3.zip",
          "fileSize": 8353634100
        }
      }
    },
    "v2": {
      "created": "2020-02-02T02:03:04.000Z",
      "archiveAlgorithm": {
        "mime": "application/zip",
        "pronomId": "x-fmt/263"
      },
      "digestAlgorithm": "sha512",
      "files": {
        "5b23ffdf2709bf393a7d8883fcdf583980": {
          "filePath": "/hpss/storage-root-01/1ec1/8fe/5cd/zipped_updates_three_versions_one_file/v2/v2-1.zip",
          "fileSize": 42644244
        }
      }
    },
    "v3": {
      "created": "2021-03-03T02:03:04.000Z",
      "archiveAlgorithm": {
        "mime": "application/zip",
        "pronomId": "x-fmt/263"
      },
      "digestAlgorithm": "sha512",
      "files": {
        "88492082026f1a3a1c0637d6bd02214dd6": {
          "filePath": "/hpss/storage-root-01/1ec1/8fe/5cd/zipped_updates_three_versions_one_file/v3/v3-1.zip",
          "fileSize": 8743244
        },
        "3a1c0637d6bd02214dd62c5c19ee8d4bbf": {
          "digestHex": "/hpss/storage-root-01/1ec1/8fe/5cd/zipped_updates_three_versions_one_file/v3/v3-2.zip",
          "fileSize": 892345
        }
      }
    }
  }
}
zimeon commented 3 years ago

I support the idea that any solution for packaged content should include support for multiple packages in a version (v1-1.zip, v1-2.zip etc) so that it could address right-sizing both groups of small files and segmenting large files #40)

I think the biggest question is where one describes the logical files vs the physical files (packages). I lean toward having the inventory describe the physical files and thus providing the infrastructure for preservation/fixity/transfer, and then create some new way to describe the logical object content in a way that doesn't make those other processes too cumbersome in the case of objects with large numbers of files. This would potentially mean significant changes in the state ideas that currently map physical to logical files in an object version.

pwinckles commented 3 years ago

I support the idea that any solution for packaged content should include support for multiple packages in a version (v1-1.zip, v1-2.zip etc) so that it could address right-sizing both groups of small files and segmenting large files #40)

If the spec adds support for zipped versions, does it necessarily need to make special mention of split zips, which are already part of the zip spec?

julianmorley commented 3 years ago

If the spec adds support for zipped versions, does it necessarily need to make special mention of split zips, which are already part of the zip spec?

I'd lean towards 'yes', based on our experiences doing something similar with Preservation Catalog. At the end of the day OCFL tracks files and their checksums. It doesn't know, for example, that a .zip file contains information that points to other zip segments, and we want a human reading the manifest to be able to see that the version directory should contain 10 files (file.zip, file.z01, file.z02, ... etc) without having to wonder if the single file.zip in the directory is meant to be just one file or the first in a series of zip segments.

My early guess is that, in OCFL v2, we'll expand inventory.json to be able to say "this version of this physical representation of this object is stored as a zip archive with these parameters", and list out all the zip parts and their checksums, together with a sidecar file that lists all the files in those zips (and their checksums).

ThomasEdvardsen commented 3 years ago

I just want to point out that we at NLN do not necessarily want to use split-zips to package small files. We may choose to package them in independent individual zip files. Then they are perhaps a little less prone to problems if one of the zip files should become corrupt. For splitting very large files, split-zips may be appropriate.

I therefore see it as an advantage if we do not lock the specification to only support split-zips.

julianmorley commented 3 years ago

We'll be sure to not mandate split-zips. We (Stanford) only split on versions greater than 10GB in our (non-OCFL) implementation of archival objects. Anything less than that goes into a single zip file. We'll probably include a way to specify a per-repo or per-object size at which the object-version would be split into multiple zips.

qqmyers commented 1 year ago

+1 from the Dataverse community. We're using Bags (1 per version, versions created and archived independently over time) today and are interested in OCFL as a way to reduce storage size (via deduplication/forward deltas) but we'd like to retain the write-only, ~one-file-per-version paradigm we have today. I think that is this use case, although the archived-versions.json file discussed above, where info about all versions is one file, would not be write-only (when versions are added over time.)

neilsjefferies commented 1 year ago

@qqmyers I think we can have an analogous mechanism to the way we treat inventories. Each version could contain a (by definition write-only) copy of the archived-versions.json but there is a separate copy elsewhere that contains the current state.

zimeon commented 11 months ago

Editors' discussion 2023-09-22:

ThomasEdvardsen commented 11 months ago

I think this suggestion could be really good, and solve how our organization can use OCFL.

So to be sure - is the new suggested block at top level or at version level? I have made a proposal where the new package block is at the version level. The only drawback I can think of is that it is only possible to have one checksum for each package file. But that might not be a problem.

So, using the example from the OCFL specification:

[object root]
├── 0=ocfl_object_1.1
├── inventory.json
├── inventory.json.sha512
├── v1
│   ├── inventory.json
│   ├── inventory.json.sha512
│   └── content
│       ├── empty.txt
│       ├── foo
│       │   └── bar.xml
│       └── image.tiff
├── v2
│   ├── inventory.json
│   ├── inventory.json.sha512
│   └── content
│       └── foo
│           └── bar.xml
└── v3
    ├── inventory.json
    └── inventory.json.sha512

The same object packed with TAR:

[object root]
├── 0=ocfl_object_1.1
├── inventory.json
├── inventory.json.sha512
├── v1.tar
├── v2.tar
└── v3.tar

v1.tar will unpack to:

v1
├── inventory.json
├── inventory.json.sha512
└── content
    ├── empty.txt
    ├── foo
    │   └── bar.xml
    └── image.tiff

Example of inventory.json with new packageManifest blocks added:

{
  "digestAlgorithm": "sha512",
  "fixity": {
    "md5": {
      "184f84e28cbe75e050e9c25ea7f2e939": [ "v1/content/foo/bar.xml" ],
      "2673a7b11a70bc7ff960ad8127b4adeb": [ "v2/content/foo/bar.xml" ],
      "c289c8ccd4bab6e385f5afdd89b5bda2": [ "v1/content/image.tiff" ],
      "d41d8cd98f00b204e9800998ecf8427e": [ "v1/content/empty.txt" ]
    },
    "sha1": {
      "66709b068a2faead97113559db78ccd44712cbf2": [ "v1/content/foo/bar.xml" ],
      "a6357c99ecc5752931e133227581e914968f3b9c": [ "v2/content/foo/bar.xml" ],
      "b9c7ccc6154974288132b63c15db8d2750716b49": [ "v1/content/image.tiff" ],
      "da39a3ee5e6b4b0d3255bfef95601890afd80709": [ "v1/content/empty.txt" ]
    }
  },
  "head": "v3",
  "id": "ark:/12345/bcd987",
  "manifest": {
    "4d27c8...b53": [ "v2/content/foo/bar.xml" ],
    "7dcc35...c31": [ "v1/content/foo/bar.xml" ],
    "cf83e1...a3e": [ "v1/content/empty.txt" ],
    "ffccf6...62e": [ "v1/content/image.tiff" ]
  },
  "type": "https://ocfl.io/1.1/spec/#inventory",
   "versions": {
    "v1": {
      "created": "2018-01-01T01:01:01Z",
      "message": "Initial import",
      "state": {
        "7dcc35...c31": [ "foo/bar.xml" ],
        "cf83e1...a3e": [ "empty.txt" ],
        "ffccf6...62e": [ "image.tiff" ]
      },
      "packageManifest": {
        "a2b5f8...d97": [ "v1.tar" ]
      },      
      "user": {
        "address": "mailto:alice@example.com",
        "name": "Alice"
      }
    },
    "v2": {
      "created": "2018-02-02T02:02:02Z",
      "message": "Fix bar.xml, remove image.tiff, add empty2.txt",
      "state": {
        "4d27c8...b53": [ "foo/bar.xml" ],
        "cf83e1...a3e": [ "empty.txt", "empty2.txt" ]
      },
      "packageManifest": {
        "c1e4d3...f82": [ "v2.tar" ]
      },      
      "user": {
        "address": "mailto:bob@example.com",
        "name": "Bob"
      }
    },
    "v3": {
      "created": "2018-03-03T03:03:03Z",
      "message": "Reinstate image.tiff, delete empty.txt",
      "state": {
        "4d27c8...b53": [ "foo/bar.xml" ],
        "cf83e1...a3e": [ "empty2.txt" ],
        "ffccf6...62e": [ "image.tiff" ]
      },
      "packageManifest": {
        "6f4a1d...b58": [ "v3-1.tar" ],
        "9e7c2f...a61": [ "v3-2.tar" ]
      },      
      "user": {
        "address": "mailto:cecilia@example.com",
        "name": "Cecilia"
      }
    }
  }
}
je4 commented 11 months ago
[object root]
├── 0=ocfl_object_1.1
├── inventory.json
├── inventory.json.sha512
├── v1.tar
├── v2.tar
└── v3.tar

This variant could be a bit problematic based on the fact, that inventory.json of the last version (if available) MUST be the same as the inventory.json within the object root. ( https://ocfl.io/1.1/spec/#version-inventory )

Solution could be to get rid of the MUST within the standard or to pack only the content folder of the version, which means, that all inventory.json are aware of the package.

neilsjefferies commented 11 months ago

I'm coming round the the idea of a separate package-inventory.json file. Then we can decide to zip or unzip a version at any time without having a new inventory.json. It's presence/absence would also be an easy indicator of the existence of packaged versions.

ThomasEdvardsen commented 11 months ago

I'm coming round the the idea of a separate package-inventory.json file. Then we can decide to zip or unzip a version at any time without having a new inventory.json. It's presence/absence would also be an easy indicator of the existence of packaged versions.

My original thought was to create this as an extension, as I suggested with the archived-versions.json file. I think including this as part of the standard implementation is even better. What are your thoughts on expansion or including it in the standard implementation @neilsjefferies ?

neilsjefferies commented 11 months ago

@ThomasEdvardsen Editors decided there was enough interest and use cases that it this was in-scope for OCFL V2 discussions.

rosy1280 commented 10 months ago

Feedback on Use Cases

In advance of version 2 of the OCFL, we are soliciting feedback on use cases. Please feel free to add your thoughts on this use case via the comments.

Polling on Use Cases

In addition to reviewing comments, we are doing an informal poll for each use case that has been tagged as Proposed: In Scope for version 2. You can contribute to the poll for this use case by reacting to this comment. The following reactions are supported:

In favor of the use case Against the use case Neutral on the use case
👍🏼 👎🏼 👀

The poll will remain open through the end of February 2024.

MormonJesus69420 commented 10 months ago

We (@ThomasEdvardsen, @je4, and I) have worked on a set of proposals for this use case, along with some questions. You can find them here: OCFL Package Per Version Workgroup Notes

zimeon commented 6 months ago

2024-02-29 Editor's agree that this should be in-scope for v2. Voting at this point is +9 in favor.