OCFL / Use-Cases

A repository to help capture, track, and discuss use cases for OCFL. Issues-only, please.
7 stars 0 forks source link

Single-file OCFL object storage (e.g., Tar, Zip) #10

Closed ahankinson closed 11 months ago

ahankinson commented 6 years ago

A multinational astronomical research initiative has several terabyte-sized datasets that it wishes to make available to researchers around the world. These datasets are published in 1 TB-sized files, and so their server filesystem is optimized for very large-sized file storage. Their OCFL Objects are stored as ZIP files to help reduce the number of small files on their storage system. They implement an OCFL server that is able to use the ZIP file header to seek within a file and extract a particular file with low overhead, effectively providing 'directory-like' lookups.

ahankinson commented 6 years ago

Notes from LDCX: Uncompressed ZIP is preferred over TAR due to a more deterministic approach to header reading and better support for path names.

Uncompressed ZIP is ISO/IEC 21320-1:2015

zimeon commented 6 years ago

I think this should be in-scope because the idea of a self-contained object as one resource is potentially useful for storage (and will help us think about transfer). To me this speaks to the split between the object-location part of the spec and the object-structure part of the spec. I can image object-location having {root}/{id-based-pairtree} under which we have either a folder called {id} or a file {id}.zip.

Are there utilities that will use byte-range requests to effectively access ZIPs in an HTTP object store?

julianmorley commented 6 years ago

https://gist.github.com/julianmorley/fbcff1f33a1113fb2ec6ea51fc06e46c I've sketched out a definition for an inventory-archive.json that could track large, archive-file objects. Combined with the regular inventory.json there should be enough info to be able to locate a desired file within the archives.

Practically, we should plan on any one version of an OCFL object being stored in one or more archive files. For example, we plan to segment any one version of our large objects into 10GB zip segments.

ahankinson commented 6 years ago

The original use case is slightly different from the one assumed in your solution, @julianmorley. It was that an entire OCFL Object can be stored as an uncompressed ZIP file, which could then be treated as a writeable object. (I've edited the text above and clarified this a bit)

I believe you are assuming individual zipped-up version directories. I think this would be a separate valid use case, so I will file one and reference this one.

neilsjefferies commented 5 years ago

Treating a zip as a writable object is not smart - updates will result in in situ temp file writing of equivalent size to the zip which breaks many OCFL assumptions. A zip can, however be mounted as a file system for reading. FWIW Sun Honeycomb object stores had the code to do that but it was never in a release version.

rosy1280 commented 3 years ago

potentially a sub-use case of #39

zimeon commented 11 months ago

Editors' discussion 2023-09-22: We have not heard of an implementation where zip-per-object desired. The treatment of ZIPs as writeable objects is not a good idea because the implementation will need a temp file the size of the uncompressed ZIP. See instead the zip-per-version use case, see #33.