Open zimeon opened 5 years ago
This could be something along the lines of:
[object root]
├── 0=ocfl_object_1.0
├── inventory.json
├── inventory.json.sha512
├── v1.zip
├── v2.zip
└── v3.zip
this still leaves three potentially small files per object (though inventory.json
might not be) but avoids any small files in the object's contents appearing alone in storage, while each v#.zip
is immutable.
potentially a sub-use case of #39
Hello everybody! The National Library of Norway is in the process of installing a new bit repository (HPSS) that can hold 44 PB of data. In this context, we are considering using OCFL to organize our data packages.
So far, OCFL looks very good, but we are dependent on ZIP per version storage #33 being resolved to be able to use OCFL. This is because we want to limit the number of files so that it becomes more efficient to store/retrieve data from HPSS.
I reckon this needs to be solved using an object extension? Do you have any thoughts on how this can be implemented?
We have begun to think about how this can be implemented based on our needs. This is a very immature first proposal for a new object extension.
We would like to discuss the following:
Arguments for allowing more than one file for each version:
What are your initial thoughts?
[object root]
├── 0=ocfl_object_1.0
├── extensions/
│ └── nnnn-archived-versions/
│ ├── archived-versions.json
│ └── archived-versions.json.sha512
├── inventory.json
├── inventory.json.sha512
├── v1/
│ ├── v1-1.zip
│ ├── v1-2.zip
│ └── v1-3.zip
├── v2/
│ └── v2-1.zip
└── v3/
├── v3-1.zip
└── v3-2.zip
Example content of archived-versions.json
{
"id": "zipped_updates_three_versions_one_file",
"versions": {
"v1": {
"created": "2019-01-01T02:03:04.000Z",
"archiveAlgorithm": {
"mime": "application/zip",
"pronomId": "x-fmt/263"
},
"digestAlgorithm": "sha512",
"files": {
"0675bdf376e92e9994612c33ea255b12f7": {
"filePath": "/hpss/storage-root-01/1ec1/8fe/5cd/zipped_updates_three_versions_one_file/v1/v1-1.zip",
"fileSize": 133410430
},
"0675b1ff76e92e9994612c33ea255b12f7": {
"digestHex": "/hpss/storage-root-01/1ec1/8fe/5cd/zipped_updates_three_versions_one_file/v1/v1-2.zip",
"fileSize": 520430330
},
"067ab1f376e92e9994612c33ea255b12f7": {
"digestHex": "/hpss/storage-root-01/1ec1/8fe/5cd/zipped_updates_three_versions_one_file/v1/v1-3.zip",
"fileSize": 8353634100
}
}
},
"v2": {
"created": "2020-02-02T02:03:04.000Z",
"archiveAlgorithm": {
"mime": "application/zip",
"pronomId": "x-fmt/263"
},
"digestAlgorithm": "sha512",
"files": {
"5b23ffdf2709bf393a7d8883fcdf583980": {
"filePath": "/hpss/storage-root-01/1ec1/8fe/5cd/zipped_updates_three_versions_one_file/v2/v2-1.zip",
"fileSize": 42644244
}
}
},
"v3": {
"created": "2021-03-03T02:03:04.000Z",
"archiveAlgorithm": {
"mime": "application/zip",
"pronomId": "x-fmt/263"
},
"digestAlgorithm": "sha512",
"files": {
"88492082026f1a3a1c0637d6bd02214dd6": {
"filePath": "/hpss/storage-root-01/1ec1/8fe/5cd/zipped_updates_three_versions_one_file/v3/v3-1.zip",
"fileSize": 8743244
},
"3a1c0637d6bd02214dd62c5c19ee8d4bbf": {
"digestHex": "/hpss/storage-root-01/1ec1/8fe/5cd/zipped_updates_three_versions_one_file/v3/v3-2.zip",
"fileSize": 892345
}
}
}
}
}
I support the idea that any solution for packaged content should include support for multiple packages in a version (v1-1.zip
, v1-2.zip
etc) so that it could address right-sizing both groups of small files and segmenting large files #40)
I think the biggest question is where one describes the logical files vs the physical files (packages). I lean toward having the inventory describe the physical files and thus providing the infrastructure for preservation/fixity/transfer, and then create some new way to describe the logical object content in a way that doesn't make those other processes too cumbersome in the case of objects with large numbers of files. This would potentially mean significant changes in the state
ideas that currently map physical to logical files in an object version.
I support the idea that any solution for packaged content should include support for multiple packages in a version (
v1-1.zip
,v1-2.zip
etc) so that it could address right-sizing both groups of small files and segmenting large files #40)
If the spec adds support for zipped versions, does it necessarily need to make special mention of split zips, which are already part of the zip spec?
If the spec adds support for zipped versions, does it necessarily need to make special mention of split zips, which are already part of the zip spec?
I'd lean towards 'yes', based on our experiences doing something similar with Preservation Catalog. At the end of the day OCFL tracks files and their checksums. It doesn't know, for example, that a .zip file contains information that points to other zip segments, and we want a human reading the manifest to be able to see that the version directory should contain 10 files (file.zip
, file.z01
, file.z02
, ...
etc) without having to wonder if the single file.zip
in the directory is meant to be just one file or the first in a series of zip segments.
My early guess is that, in OCFL v2, we'll expand inventory.json
to be able to say "this version of this physical representation of this object is stored as a zip archive with these parameters", and list out all the zip parts and their checksums, together with a sidecar file that lists all the files in those zips (and their checksums).
I just want to point out that we at NLN do not necessarily want to use split-zips to package small files. We may choose to package them in independent individual zip files. Then they are perhaps a little less prone to problems if one of the zip files should become corrupt. For splitting very large files, split-zips may be appropriate.
I therefore see it as an advantage if we do not lock the specification to only support split-zips.
We'll be sure to not mandate split-zips. We (Stanford) only split on versions greater than 10GB in our (non-OCFL) implementation of archival objects. Anything less than that goes into a single zip file. We'll probably include a way to specify a per-repo or per-object size at which the object-version would be split into multiple zips.
+1 from the Dataverse community. We're using Bags (1 per version, versions created and archived independently over time) today and are interested in OCFL as a way to reduce storage size (via deduplication/forward deltas) but we'd like to retain the write-only, ~one-file-per-version paradigm we have today. I think that is this use case, although the archived-versions.json file discussed above, where info about all versions is one file, would not be write-only (when versions are added over time.)
@qqmyers I think we can have an analogous mechanism to the way we treat inventories. Each version could contain a (by definition write-only) copy of the archived-versions.json but there is a separate copy elsewhere that contains the current state.
Editors' discussion 2023-09-22:
I think this suggestion could be really good, and solve how our organization can use OCFL.
So to be sure - is the new suggested block at top level or at version level? I have made a proposal where the new package block is at the version level. The only drawback I can think of is that it is only possible to have one checksum for each package file. But that might not be a problem.
So, using the example from the OCFL specification:
[object root]
├── 0=ocfl_object_1.1
├── inventory.json
├── inventory.json.sha512
├── v1
│ ├── inventory.json
│ ├── inventory.json.sha512
│ └── content
│ ├── empty.txt
│ ├── foo
│ │ └── bar.xml
│ └── image.tiff
├── v2
│ ├── inventory.json
│ ├── inventory.json.sha512
│ └── content
│ └── foo
│ └── bar.xml
└── v3
├── inventory.json
└── inventory.json.sha512
The same object packed with TAR:
[object root]
├── 0=ocfl_object_1.1
├── inventory.json
├── inventory.json.sha512
├── v1.tar
├── v2.tar
└── v3.tar
v1.tar will unpack to:
v1
├── inventory.json
├── inventory.json.sha512
└── content
├── empty.txt
├── foo
│ └── bar.xml
└── image.tiff
Example of inventory.json with new packageManifest blocks added:
{
"digestAlgorithm": "sha512",
"fixity": {
"md5": {
"184f84e28cbe75e050e9c25ea7f2e939": [ "v1/content/foo/bar.xml" ],
"2673a7b11a70bc7ff960ad8127b4adeb": [ "v2/content/foo/bar.xml" ],
"c289c8ccd4bab6e385f5afdd89b5bda2": [ "v1/content/image.tiff" ],
"d41d8cd98f00b204e9800998ecf8427e": [ "v1/content/empty.txt" ]
},
"sha1": {
"66709b068a2faead97113559db78ccd44712cbf2": [ "v1/content/foo/bar.xml" ],
"a6357c99ecc5752931e133227581e914968f3b9c": [ "v2/content/foo/bar.xml" ],
"b9c7ccc6154974288132b63c15db8d2750716b49": [ "v1/content/image.tiff" ],
"da39a3ee5e6b4b0d3255bfef95601890afd80709": [ "v1/content/empty.txt" ]
}
},
"head": "v3",
"id": "ark:/12345/bcd987",
"manifest": {
"4d27c8...b53": [ "v2/content/foo/bar.xml" ],
"7dcc35...c31": [ "v1/content/foo/bar.xml" ],
"cf83e1...a3e": [ "v1/content/empty.txt" ],
"ffccf6...62e": [ "v1/content/image.tiff" ]
},
"type": "https://ocfl.io/1.1/spec/#inventory",
"versions": {
"v1": {
"created": "2018-01-01T01:01:01Z",
"message": "Initial import",
"state": {
"7dcc35...c31": [ "foo/bar.xml" ],
"cf83e1...a3e": [ "empty.txt" ],
"ffccf6...62e": [ "image.tiff" ]
},
"packageManifest": {
"a2b5f8...d97": [ "v1.tar" ]
},
"user": {
"address": "mailto:alice@example.com",
"name": "Alice"
}
},
"v2": {
"created": "2018-02-02T02:02:02Z",
"message": "Fix bar.xml, remove image.tiff, add empty2.txt",
"state": {
"4d27c8...b53": [ "foo/bar.xml" ],
"cf83e1...a3e": [ "empty.txt", "empty2.txt" ]
},
"packageManifest": {
"c1e4d3...f82": [ "v2.tar" ]
},
"user": {
"address": "mailto:bob@example.com",
"name": "Bob"
}
},
"v3": {
"created": "2018-03-03T03:03:03Z",
"message": "Reinstate image.tiff, delete empty.txt",
"state": {
"4d27c8...b53": [ "foo/bar.xml" ],
"cf83e1...a3e": [ "empty2.txt" ],
"ffccf6...62e": [ "image.tiff" ]
},
"packageManifest": {
"6f4a1d...b58": [ "v3-1.tar" ],
"9e7c2f...a61": [ "v3-2.tar" ]
},
"user": {
"address": "mailto:cecilia@example.com",
"name": "Cecilia"
}
}
}
}
[object root]
├── 0=ocfl_object_1.1
├── inventory.json
├── inventory.json.sha512
├── v1.tar
├── v2.tar
└── v3.tar
This variant could be a bit problematic based on the fact, that inventory.json of the last version (if available) MUST be the same as the inventory.json within the object root. ( https://ocfl.io/1.1/spec/#version-inventory )
Solution could be to get rid of the MUST within the standard or to pack only the content folder of the version, which means, that all inventory.json are aware of the package.
I'm coming round the the idea of a separate package-inventory.json file. Then we can decide to zip or unzip a version at any time without having a new inventory.json. It's presence/absence would also be an easy indicator of the existence of packaged versions.
I'm coming round the the idea of a separate package-inventory.json file. Then we can decide to zip or unzip a version at any time without having a new inventory.json. It's presence/absence would also be an easy indicator of the existence of packaged versions.
My original thought was to create this as an extension, as I suggested with the archived-versions.json file. I think including this as part of the standard implementation is even better. What are your thoughts on expansion or including it in the standard implementation @neilsjefferies ?
@ThomasEdvardsen Editors decided there was enough interest and use cases that it this was in-scope for OCFL V2 discussions.
In advance of version 2 of the OCFL, we are soliciting feedback on use cases. Please feel free to add your thoughts on this use case via the comments.
In addition to reviewing comments, we are doing an informal poll for each use case that has been tagged as Proposed: In Scope
for version 2. You can contribute to the poll for this use case by reacting to this comment. The following reactions are supported:
In favor of the use case | Against the use case | Neutral on the use case |
---|---|---|
👍🏼 | 👎🏼 | 👀 |
The poll will remain open through the end of February 2024.
We (@ThomasEdvardsen, @je4, and I) have worked on a set of proposals for this use case, along with some questions. You can find them here: OCFL Package Per Version Workgroup Notes
2024-02-29 Editor's agree that this should be in-scope for v2. Voting at this point is +9 in favor.
In cases where there are many small files in an object or where the storage infrastructure is not efficient at handling many files, it is useful to package files using a technology such as ZIP. This is addressed for the whole object in #10. However, packaging the whole object as a ZIP/Tar etc. breaks the idea of immutability of version data. One could instead package the inventory and content for each new version as a new ZIP file.